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PREFACE 


This  document  is  the  final  technical  report  (CDRL  Item  A003)  for  the 
Software  Reliability  Study,  Cor,*trc'‘t  No.  F30602-74-C-0036.  It  presents 
results  of  a study  of  data,  principally  error  data,  collected  from  four 
software  development  projects.  These  data  were  analyzed  to  determine  what 
might  be  learned  about  various  types  of  errors  in  the  software;  tb'a  effec- 
tiveness of  the  development  and  test  itrategies  in  preventing  and  detect- 
ing errors,  respectively;  and  the  reliability  of  the  software  itself. 

This  report  also  describes  data  thet  are  generally  available,  how 
these  data  were  used  in  this  study,  and  some  observed  realities  concern- 
ing the  data  collection  and  analysis  processes. 

Finally,  the  most  recent  work  on  TRW's  Mathematical  Theory  of  Software 
Reliability  (MTSR),  the  Nelson  model,  is  presented.  This  is  complemented 
by  a survey  of  software  reliability  models  currently  available  in  the  soft- 
ware cownunity. 

Principal  Contributors: 

G.  R.  Craig  E.  C.  Nelson  B.  B.  White 

L.  E.  Frey  T.  A.  Thayer 

W.  L.  Hetrick  d,  A.  Yoxtheimer 

M.  Lipow  J,  A.  Whited 


ACKNOWLEDGEMENTS 


Software  development  projects  have  a potential  for  creating  tremendous 
amounts  of  data,  not  all  of  which  are  recognized  as  being  valuable  at  the 
outset  of  a study  such  as  this.  Further,  analysis  of  the  data  is  best 
accomplished  with  assistance  from  those  who  created  the  data;  they  alone 
are  able  to  interpret  some  of  the  data. 

For  their  'nuine  interest  and  efforts  in  providing  data  and  helping 
with  the  analy*iv,  the  authors  would  like  to  thank  the  following  people. 


D.  H.  Barakat 
M.  G.  Calhoun 

C.  D.  Calvin 
T.  P.  Dillon 
K.  F.  Fischer 
H.  W.  Hawthorne 

D.  E.  Heine 

F.  S.  Ingrassia 


A.  S.  Liddle 

R.  N.  Schreiner 

B.  R.  Seefeldt 
L.  F.  Summerill 
P.  N.  Taylor 

t.  R.  Vallembois 

C.  E.  White 


EVALUATION 


The  need  for  wore  ‘reliable  software'  by  nil  major  commands,  the  need  for 
research  and  development  in  software  reliability,  specifically  to  investigate 
the  sources  of  errors  in  software,  was  stated  in  the  Command,  Control  Infor- 
mation Processing  CCIP-85  Study  (Information  Processing/Data  Automation 
Implications  of  Air  Force  Command  And  Control  Requirements  In  The  1980’s). 

In  areerapts  to  sutisfy  theae  needs,  various  approaches  have  been  proposed  In 
recent  years;  some  partially  applicable  others  directed  cowards  a specific 
problem.. 

This  effort  was  initiated  in  response  to  the  CCIP-85  Study  and  fits  into  the 
goals  of  RADC  TPO  No.  11,  Software  Sciences  Technology,  in  particular  the 
area  of  Software  Quality  (Error  Data  Analysis).  The  report  focuses  on  the 
problems  of  software  error  identification,  collection,  categorization  and 
analysis  as  well  as  the  cools  and  techniques  utilized.  Tne  importance  of 
the  study  is  to  bring  together  past  and  current  experience  in  software 
reliability  in  order  to  form  a methodology  which  can  be  used  to  produce  more 
rclluble  software  as'^cll  as  an  insight  to  improvements  in  the  processes 
involved. 

Many  questions  can  be  answered  erning  software  reliability  by  collecting 
and  analyzing  data.  To  become  more  informed  of  the  types  of  errors  one  may 
come  across,  the  distinctive  features  of  the  software  development  process 
that  propagated  them  and  the  particulars  under  which  they  arise,  is  a 
beginning  in  being  able  to  specify  measures  of  software  quality  in  a 
quantitative  manner.  These  measures  can  be  used  by  both  buyers  and  producers 
of  large  software  systems. 

2/J&. 

^ JAMES  V.  CELLINI,  JR. 
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1.0  INTRODUCTION 

1 . 1 Study  Objectives 


The  objectives  of  the  Software  Reliability  Study  may  be  summarized 
by  the  following: 

• Determine  what  software  structural  and  development 
characteristics  are  available  for  analysis  and 
which  of  these  characteristics  are  relevant  to  the 
description  (or  prediction)  of  software  reliability, 

• Define  improved  methods  for  collecting  reliability 
data. 

• Based  on  error  histories  seen  in  the  data,  define 
sets  of  error  categories,  both  causative  and  sympto- 
matic, to  be  applied  in  the  analysis  of  software 
problem  reports  and  their  closures. 

• Recommend  changes  in:  1)  development  techniques  to 
enhance  the  error-freeness  (reliability)  of  the 
coded  product  and  2)  test  techniques  to  make  it 
possible  to  find  more  errors  earlier. 

• Perform  a survey  of  existing  software  reliability 
models. 

• Extend  Nelson's  Mathematical  Theory  of  Software 
Reliability  (MTSR)  and  apply  it  to  data  collected 
on  an  ongoing  software  development  project. 

These  objectives  are  achieved  principally  through  analysis  of  empiri- 
cal data  collected  during  software  testing,  although  as  the  reader  will  note 
in  subsequent  sections,  what  we  see  in  testing  is  largely  determined  by 
what  occurs  in  the  preceding  phases  of  the  software  development  cycle,  i.e,, 
during  the  requirements  analysis,  design,  and  coding  phases.  Therefore, 
when  necessary  (and  possible)  this  study  has  examined  other  software 
development  disciplines,  too. 

1.2  Background 

The  driving  force  behind  this  study  is  the  idea  that  much  can  be 
said  about  the  quality  and  reliability  of  software  from  the  software's 
error  history.  This  idea  is  a popular  one  and  one  which  serves  as  a cen- 
tral theme  in  this  study. 

Although  the  term  "software  reliability"  is  defined  in  Sections  5,0 
and  6.0  in  conjunction  with  software  reliability  models,  a more  general 
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definition  for  purposes  of  the  analysis  of  empirical  error  data  is  in 
order.  The  definition  chosen  is  the  one  offered  in  Reference  1. 

Software  possesses  reliability  to  the  extent  that  it 
can  be  expected  to  perform  its  intended  functions 
satisfactorily. 

Assuming  that  documented  errors  represent  an  inability  to  perform  intended 
functions  satisfactorily,  error-free  software  would  be  reliable  software. 
This  assumption  is  made  in  the  investigation  of  errors  and  other  software 
characteristics. 

In  the  definition  above,  the  word  "satisfactorily"  also  needs  defini- 
tion. A universally  applicable  quantitative  definition  dees  not  presently 
exist;  however,  in  the  world  of  software  problem  reports  mere  creation  of 
the  report  registers  some  dissatisfaction  with  performance,  whether  the 
report  documents  a "eal  problem,  a need  for  software  enhancement,  or  what 
turns  out  to  be  no  problem  at  all. 

Using  the  definition  given  above,  the  number  of  errors  documented  on 
problem  reports  can  be  used  as  an  indicator  of  software  reliability.  How- 
ever, raw  counts  of  errors  don't  tell  the  whole  story  and  can  be  misleading. 
Therefore,  further  analysis  is  necessary  to  determine  the  types  of  errors, 
when  and  where  they  were  introduced,  how  they  were  detected,  and  their 
impact  on  the  operation  of  the  software  system. 

1.3  Approach 

The  fundamental  approach  taken  in  this  study  has  been  to  base  analy- 
sis on  real  data  gathered  from  four  large  software  projects:  two  command 
and  control"  systems,  a data  management  system,  and  a highly  analytical 
real-time  system.  An  attempt  has  been  made  to  quantify  the  various  char- 
acteristics of  the  software,  the  development  program  that  produced  the 
software,  the  test  program  that  detected  the  errors,  and  finally,  the 
errors  themselves. 

The  study  was  conducted  in  two  parts.  The  first  part  consisted  of 
collecting  and  analyzing  data  from  two  completed  projects  (three  projects 
were  actually  used)  to  get  a better  picture  of  data  availability  in  order 
to  determine  what  parameters  were  generally  available  and  meaningful  to 
analyze.  Analysis  techniques  were  also  developed  during  the  first  part 
of  the  study.  The  second  part  of  the  study  called  for  an  application  of 
findings  and  techniques  identified  during  the  first  part  of  the  study  to 
a fourth,  ongoing  project. 

Also  in  the  last  part  of  the  study  work  was  done  to  expand  and 
evaljate  the  Mathematical  Theory  of  Software  Reliability  (MTSR).  This 
software  reliability  model  is  particularly  noteworthy  since  it  addresses 
data  and  the  functional  characteristics  of  the  software  in  its  assessment 
of  reliability. 

Details  of  specific  approaches  taken  in  the  analysis  of  data  and 
expansion  of  MTSR  are  given  throughout  the  remainder  o'"  this  document. 
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1 .4  Terminology 


Although  a glossary  of  terms  is  presented  in  Appendix  A,  several 
general  terms  need  definition  to  familiarize  the  reader  with  what  follows 
in  this  document.  These  are  terms  used  by  TRW  and  some  of  its  customers, 
ami  as  such,  may  not  be  commonly  understood. 

Project  - The  end  product  of  a project  is  the  software  and  its  documenta- 
tion." the  project  itself  is  the  combination  of  development  activities, 
personnel,  material  resources,  schedules,  etc.,  required  to  produce  the 
software  and  its  documentation.  In  this  study  we  talk  about  four  projects, 
and  since  they  have  provided  data  for  analysis,  we  call  them  source 
projects. 

Programmer  - For  the  projects  under  study  a programmer  is  the  individual 
who  creates  the  software  product  and  documentation.  Sometimes  called  a 
developer  or  programmer/analyst,  he  designs,  codes,  and  tests  the  portion 
of  the  end  product  for  which  he  is  responsible.  He  doesn't  just  code. 

Tester  - In  this  study  a tester,  or  test  analyst,  is  an  individual  who 
accepts  delivery  of  the  software  from  the  developers  and  executes  it  to 
verify  that  it  performs  its  intended  functions  correctly.  He  prepares 
test  plans  and  procedures,  executes  the  test  procedures,  analyzes  test 
results,  and  documents  problems.  He  is  independent  of  the  developers  in 
the  sense  that  he  does  not  create  the  product.  He  is,  however,  required 
to  work  with  the  developers  to  verify  that  corrections  to  the  software  work 
properly  through  retesting.  The  term  tester  is  also  given  to  the  system 
integration  contractor  parsonnel  and  customer  people  involved  in  test 
activities. 

Problem/Error  - A problem  is  a user  oriented  registration  of  dissatisfac- 
tion which  is  documented  on  a form  we  generically  call  a problem  report. 

It  can  be  either  symptomatic  or  causative  and  need  not  be  the  result  of 
an  execution  of  the  software  (e.g.,  code  inspections  can  result  in  the 
detection  of  problems).  In  this  report  we  use  the  terms  problem  and  error 
synonymously,  although  we  recognize  that  they  should  be  different.  This 
is  a concession  made  necessary  by  the  fact  that  we  have  used  real  data  from 
large  projects,  and  it  has  not  been  possible  to  retrace  each  problem  report 
to  determine  the  original  error. 

Ideally,  an  error  would  be  defined  as  a causative  act  producing  a 
fault*  in  tne  software  which,  if  evoked,  would  result  in  a symptomatic 
execution  ft  i lure. 

The  assumption  made  in  this  study  is  that  one  problem  report  documents 
one  error  or,  more  accurately,  one  fault.  Of  course,  this  is  not  always  the 
case.  We  therefore  work  with  "actual"  problems  or  "actual"  errors,  i.e., 
those  problems  that  required  a change  in  the  code  to  affect  corrective 
action.  By  considering  only  the  code  change  problems  and  performing  analy- 
ses at  the  routine  level,  the  inaccuracies  of  the  above  assumption  are 
minimized. 


* 


Some  people  call  these  "bugs. 
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1 .5  Suggestions  on  Reading  This  Document 

This  document  presents  the  results  of  the  Software  Reliability  Study 
in  seven  major  sections.  As  might  be  imagined,  there  is  some  diverseness 
between  the  essentially  theoretical  work  concerning  software  reliability 
models  and  the  work  connected  with  collecting  and  analyzing  real  data. 

Also,  since  the  Software  Reliability  Study  had  few  precedents  to  follow, 
results  included  many  lessons  learned  about  collecting  data  and  methods 
of  analysis  which  should  be  valuable  to  those  who  attempt  similar  studies 
in  the  future.  Parts  of  this  document  will  hold  varying  importance  for 
different  readers.  Therefore,  this  document  has  been  organized  in  such  a 
way  that  topics  of  data  description,  data  collection,  results  of  analysis, 
and  reliability  modeling  appear  in  separate  sections.  And,  it's  not 
imperative  that  sections  be  read  sequentially.  Lessons  learned  and  results 
are  presented  as  part  of  each  section. 

In  our  previous  study  of  empirical  data,  work  done  in  support  of  the 
CCIP-85  study  group  [5],  our  reconrnendation  to  the  reader  who  was  tempted 
to  further  analyze  data  included  in  the  final  report  was  "don't."  This  was 
due  to  the  quality  of  data,  collected  and  analyzed  retrospectively,  and  the 
lack  of  supporting  Information  to  explain  the  error  data.  Here  our  suggestion 
to  the  same  reader  is  "go  ahead."  Although  we  haven't  presented  the  sheer 
volume  of  raw  data  that  was  in  the  previous  study,  the  quality  of  data  pre- 
sented in  this  document  is  much  better,  and  supporting  information  to  explain 
trends  and  statistical  "outliers"  is  available. 

There  is,  however,  one  fairly  significant  caveat.  Our  experience  in 
this  study  has  shown  that  there  can  be  tremendous  variabilities  from  one 
project  to  the  next,  and  for  that  matter,  between  different  portions  of  the 
same  project.  What  is  true  for  one  project  may  not  be  true  for  the  next, 
even  when  the  project  performers  are  essentially  the  same  people.  So,  the 
reader  is  cautioned  to  view  findings  presented  here  and  any  he  may  conclude 
from  his  own  analyses  with  caution  in  an  application  to  other  projects. 

The  following  paragraphs  briefly  discuss  contents  of  the  remainder  of 
this  document. 

Section  2.0  contains  a description  of  the  four  source  projects  and 
the  data  provided  by  these  projects.  Although  not  strictly  necessary 
reading,  it  provides  a background  for  Section  4.0,  which  contains  the 
results  of  analysis. 

Section  3.0  specifically  addresses  the  categorization  of  software 
errors.  It  discusses  a method  for  generating  categories  and  recommends 
methods  for  increasing  the  quality  of  data  collected.  It  also  offers  a 
brief  history  of  category  lists  used  in  TRW's  experience. 

Section  4.0  contains  the  results  of  analyses  performed  during  the 
study.  An  attempt  has  been  made  in  this  section  to  define  the  data,  con- 
ditions, qualifiers,  and  assumptions  made  concerning  each  investigation. 

Section  5.0  presents  a summary  of  software  reliability  models  pres- 
ently available  in  the  industry. 
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Section  6,0  presents  work  accomplished  on  the  Mathematical  Theory 
of  Software  Reliability  and,  to  the  extent  possible,  work  done  in  evaluating 
this  model. 

Section  7.0  presents  observed  realities  and  lessons  learned  about 
data  collection. 

Finally,  Section  8.0  summarizes  major  study  conclusions. 
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-.0  DATA  DESCRIPTION 


In  this  section  we  will  describe  the  data  that  are  available  for 
study.  Data  were  taken  from  four  software  projects  which,  for  purposes 
of  anonymity,  have  been  assigned  names  other  than  their  real  project 
designators.  Table  2-1,  below,  lists  the  categories  of  data  which  are 
generally  available  from  software  development  projects  and  the  projects 
for  which  each  category  is  available.  Each  of  these  categories  will  be 
described  in  detail. 

2.1  General  Project  Descriptions 


The  four  subject  projects  all  represent  major  software  development 
activities.  One  of  them,  Project  3,  involves  a co-contractor  other 
than  TRW  producing  some  subset  of  the  total  software  package,  Specific 
findings  traceable  to  the  interface  or  differing  development  techniques 
are  being  analyzed,  recognizing  characteristics  of  the  multi -contractor 
relationship.  Although  this  multi -contractor  relationship  is  believed 
to  be  extremely  important  to  the  analysis,  review  of  the  data  includes 
a "one  contractor"  survey  as  well. 


Table  2-1.  Data  Availability 


Data  Category 


1)  General  Project 
Descriptions 

2)  Design  Problem  Oata 

3)  Problem  Report 
(Error)  Data 

4)  Software 
Characteristics 

5)  Testing  Data 

6)  Personnel  Data 

7)  Computer  Usage  Data 


Project 

2 


Project 

3 


Project 

4 


♦Detail  of  software  characteristics  for  these  projects  is  compara- 
tively less  than  that  which  is  available  for  Projects  2 and  3 due 
to  the  unavailability  of  automated  tools. 
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2.1.1  Project  2 

Project  2*  Is  the  Project  B mentioned  In  Reference  5,  Date  avail- 
able for  this  project  was  collected  during  four  major  modifications  to 
the  Final  Operating  Configuration  (FOC)  version  of  that  software.  These 
modifications  are  identified  as  M0D1A,  M0D18,  MQD1BR,  and  KQD2  and  are 
considered  generally  as  separate  development  packages  (in  fact,  separate 
projects),  each  with  its  own  cycle  of  design,  coding,  debug,  and  formal 
test  activities. 

The  software  itself  Is  a command  and  control  software  system  written 
in  JOVIAL  JA.  Only  the  applications  software  is  considered  in  this  study, 
although  error  data  for  the  operating  system  (SYHON)  and  its  system  support 
software  was  sought  in  the  course  of  data  collection.  OS  software  was 
virtually  error  free  for  this  project  and  Project  3. 

A majority  of  the  errors  for  Project  2 was  detected  during  formal 
test  cycles  composed  of  validation,  acceptance,  and  integration  testing 
(See  Section  4.2.2).  As  with  the  FOC  version,  validation  and  acceptance 
t'ists  were  conducted  by  TRW,  while,  a separate  contractor  performed  the 
system  integration  tests.  No  operational  demonstrations  were  conducted 
prior  to  delivery  of  the  software  to  the  customer,  however. 

Software  Development  for  each  modification  was  governed  by  formally 
specified  and  approved  requirements,  and  it  was  these  requirements  tha£ 
formal  validation  and  acceptance  tests  were  designed, to  demonstrate. 

System  integration  testing,  on  the  other  hand,  was  directed  at  verifying 
compatibility  of  the  applications  software  with  the  operating  system 
environment/ 

Structurally,  the  smallest  compilable  unit  of  source  cede  was  the 
routine.  Routines  were  joined  to  form  functions,  functions  were  joined 
to  form  subsystems,  and  finally  the  system  was  comprised  of  several  sub- 
systems. This  structure,  which  was  also  maintained  for  Project  3,  was 
produced  by  a project  organization  based  on  the  function.  That  is,  a 
work  unit  (a  group  of  developers  or  testers)  was  responsible  for  one  or 
more  whole  functions,  a relation  that  has  proved  helpful  in  the  analysis. 

2.1.2  Project  3 

Project  3 represents  an  initial  delivery  of  a large  command  and  con- 
trol software  package.  The  applications  software  is  written  in  JOVIAL  34 
and  is  compatible  with  the  SYMON  operating  system. 

As  with  Project  2,  the  majority  of  errors  analyzed  were  detected 
uurirg  formal  testing;  however,  operational  data  for  Project  3 spanning 
a period  of  approximately  one  year  was  also  analyzed.  In  other  respects 
the  development  and  test  characteristics  are  quite  similar  to  those  for 
Project  2,  Table  2-2.  As  was  mentioned  earlier,  TRW  was  not  the  sole 

"Project  1 corresponds  to  Project  A mentioned  in  Reference  5. 

It  is  not  referenced  in  this  study. 
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Table  2-2.  General  Project  Characteristics 


Project  2 

Project  3 

Project  4 

Project  5 

Size  (total  source 
statements) 

96,931 

115,346 

11,105* ** 
17,459  ML I 

Number  of  routines 

173 

249 

190 

531 

Language 

JOVIAL  J4 

JOVIAL  J4 

PWS 

FORTRAN 

and 

Assembly 

Formal 

Requirements 

To  func- 
tion level 

To  func- 
tion level 

To  software 
system  level 

To  routine 
level 

Documentation 

Standard 

SSD 

Exhibit 

61-47B 

SSD 

Exhibit 

61-47B 

TRW 

Standards 

MIL  STD 
490 

Co-contractor 

(routines) 

No 

r«  (77) 

No 

No 

Operating  Mode 

Batch 

Batch 

On-line 
or  batch 

Real  time 
and  batch 

Formal  Testing 
(descending  in 
order  of 
occurrence) 

• Valida- 
tion 

• Accept- 
ance 

• Valida- 
tion 

• Accept- 
ance 

• Routine 
Development 

• Routine 
Integration 

• Integra- 
tion 

• Integra- 
tion 

• Process 
Integration 

1 

• Opera- 
tional 
Demon- 
stration 

t Performance 
Evaluation 

• Operational 
Demonstra- 
tion 

i 

contractor  writing  applications  software,  but  data  resulting  from  both 
contractors  was  available  through  a common  configuration  management 
source.  Some  data,  e.g.,  personnel  characteristics  and  computer  time 
usage,  are  available  only  for  the  TRW  portion  of  the  software. 

*W$r  macros  not  measurable  in  comparable  units. 

**Project  5 size  here  in  measured  in  executable  source  statements 
for  FORTRAN  code  and  in  machine  language  instructions  for  assembly 
language  code. 


2.1,3  Project  4 


Project  4 is  a generalized  information  processing  system  written  in 
a specially  designed  macro  language  called  Program  Word  Structure  or  PWS, 
This  system,  with  data  storage,  retrieval,  and  reporting  capabilities,  is 
highly  flexible  in  that  it  can  easily  be  tailored  to  suit  a user's  require- 
ments through  use  of  an  English-like  language. 

The  Project  4 structure  begins  with  the  subroutine  as  the  smallest 
compilable  unit  of  code.  These  are  nestaole  and  can  call  other  subroutines 
or  themselves.  Subroutines  group  to  form  modes  and  modes  are  grouped  to 
form  decks.  Decks  are  link  edited  to  form  pages. 

It  is  Important  to  note  that  all  Project  4 data  is  representative 
of  operational  software.  No  data  for  the  Project  4 development  period 
was  available. 

2.1.4  Project  5 


Project  5 is  the  on-going  software  development  project  which  was 
selected  to  serve  as  a test  bed  for  evaluation  of  early  S.R.S.  data 
collection  and  analysis  techniques.  It  was  chosen  because  of  its 
schedule,  the  availability  of  development  and  test  tools,  and  the  fact 
that  it  is  producing  state-of-the-art  realtime  software  using  top-down 
structured  programing.  Software  being  produced  on  Project  5 includes 
a realtime  data  processor  (the  applications  software),  a realtime 
operating  system,  a realtime  simulator  to  simulate  the  external  hardware 
and  operational  environments,  and  support  software,  both  batch  and 
realtime,  to  support  construction,  testing,  and  configuration  manage- 
ment of  the  software  system. 

Not  all  of  the  Project  5 code  was  used  for  study,  however.  We 
analyzed  data  from  the  applications,  simulator,  and  operating  system 
software,  all  of  which  is  realtime  code,  and  the  batch  mode  Product 
Assurance  tools,  Including  a dynamic  path  analyzer,  a code  auditor, 
and  a code  structure  analyzer. 

Development  and  test  techniques  being  used  on  Project  5 make  it 
particularly  attractive  as  an  object  of  study.  First  of  all,  true 
top-down  incremental  development  is  being  followed.  That  is  to  say, 
critical  portions  of  the  software  are  developed  and  tested  in  a series 
of  "increments"  and  less  critical  portions  are  simulated  using  stubs 
(dummy  elements).  Each  increment  is  a complete  development  cycle  in 
itself,  consisting  of  design  coding,  and  testing  phases.  And,  each 
successive  increment  adds  more  of  the  total  system  capability  by 
replacing  stubs  with  deliverable  code.  In  this  manner  risk  can  be 
assessed  early  and  critical  algorithms  analyzed  thoroughly  before  large 
portions  of  programs  have  been  committed  to  code.  The  degree  of  breakage 
or  recoding  on  Project  5 is  minimized  in  this  manner  and  flexibility  is 
maintained  until  optimum  design  solutions  are  achieved.  As  the  critical 
algorithms  and  routines  are  shown  to  work  successfully,  lower  levels  of 
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subroutines  are  developed  and  the  total  system  is  integrated  and 
tested.  Other  projects  in  the  study  were  developed  in  a one-time, 
all-at-once  traditional  dovciopment  cycle. 

Project  5 is  also  being  developed  under  rigorously  enforced  standards 
and  procedures.  The  size  and  Juration  of  the  project,  coupled  with 
inherent  turnover  of  personnel  dictated  that  a consistent  set  of  coding 
standards  be  developed.  During  the  contract  definition  phase  a Software 
Standards  and  Procedures  Manual  was  developed.  This  manual  contains 
thirty-two  (32)  discrete  standards  dealing  with  the  following  major  areas: 

• Number  of  executable  statements 

• Flow  charting 

• Nomenclature  or  naming  conventions 

• Preface  and  inline  comments 

• Structured  Programming 

• Statements  Labels 

The  standards  and  procedures  are  enforced  by  a product  assurance 
group  utilizing  an  automated  test  tool  entitled  "Code  Auditor"  which 
operates  on  routines  and  programs  to  check  compliance  for  32  individual 
coding  instructions. 

Project  5 is  also  attractive  from  a testing  standpoint  since  con- 
siderable emphasis  is  placed  on  unit  or  routine  level  testing.  A project 
standard  calls  for  exercise  of  all  executable  statements  and  branch 
points  during  unit  testing,  and  for  the  realtime  applications  software 
there  is  a requirement  to  execute  all  paths*  at  the  unit  level. 

2.1.5  Summary  of  General  Project  Descriptions 

Individual  project  characteristics  may  best  be  displayed  in  tabular 
form  in  relation  to  similar  characteristics  for  the  other  subject  projects. 
Table  2-2.  When,  in  the  course  of  analysis,  project-specific  characteris- 
tics are  germane  to  results,  these  will  be  identified  and  contrasted  with 
the  similar  (or  dissimilar)  characteristics  from  other  projects. 

2.2  Design  Problem  Data 

To  date,  very  little  has  been  done  to  analyze  design  problems 
uncovered  in  the  formal  reviews  of  the  preliminary  and  detailed  design. 

This  is  mentioned  since  these  reviews,  typically  designated  the  PDR  and 
CDR,  respectively,  produce  a significant  amount  of  data  critiquing  the 

♦Paths'  through  1 oops  are  defined  to  be  at  least  one  traversal  of  the  loop, 
although  more  traversals  typically  occur. 
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basic  design.  This  information  is  easily  collected,  but  never  used. 

In  fact,  the  design  problem  report  (DPR)  is  difficult  to  analyze 
in  its  present  form  because  it  is  written  without  benefit  of  guidelines, 
and  the  important  parameters  are  not  presently  defined  or  known.  This 
wealth  of  information  (Project  3 produced  5619  such  reports!),  some  of 
it  describing  acute  problems  resulting  from  the  design  process,  faces 
the  same  need  for  organization  that  the  software  problem  report  has 
required. 

The  extent  to  which  the  Project  3 DPR's  will  be  used  is  limited. 

A raw  count  of  total  occurrences  by  routine  is  being  used  only  to  get  a 
feel  for  the  amount  of  review  each  routine  received,  (SeelJectior.  4.5.4.) 

2.3  Problem  Report  (Error)  Data 


The  software  problem  report  (SPR)  and  the  closure  report  (MTM)* 
form  the  backbone  of  data  for  this  study.  It  is  from  the  combination  of 
these  reports  that  the  error  category  is  determined.  For  each  of  the 
subject  projects  the  SPR/MTM  pair  provided  the  vehicles  for  opening  and 
closing  a problem. 

Every  SPR  required  at  least  one  MTM,  whether  the  problem  was  a real 
one  or  not.  An  SPR  could  be  written  against  a routine,  the  data  base, 
the  C0MP00L  (in  the  case  of  Projects  2 and  3),  a document,  a test  case, 
or  just  to  ask  a question.  It  was  also  used  to  place  product  improve- 
ments on  the  CM  records  for  future  software  updates.  From  the  SPR  the 
following  may  be  determined. 

• The  subject  of  the  problem  (routine,  data  base,  etc.) 

• When  it  was  discovered 

• The  test  case  that  pointed  out  the  problem 

• The  software  configuration  when  the  problem  was  found 

• A description  of  the  problem. 

The  MTM  closes  the  problem  by  1)  delivering  a new  modification  to 
the  software,  2)  delivering  a change  to  the  data  base  (values)  or  C0MP00L 
(definitions),  or  by  demonstrating  that  the  problem  is  not  a problem, 
i.e.,  an  explanatory  closure.  It  also  indicates  whether  the  SPR  was 
written  against  the  right  subject.  The  key  information  on  the  MTM, 
however,  is  the  explanation  of  the  problem  and  the  fix.  For  Projects  4 


rModificaticn  Transmittal  Memorandum,  Projects  2 and  3, 
Project  4 and  Project  H used  only  one  form,  the  SPR 
and  Discrepancy  Report  (DR),  respectively. 
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and  5 the  software  problem  report  was  used,  as  a single  piece  of  paper, 
to  both  open  and  close  the  problem.  See  Appendix  B for  sample  problem 
reports. 

Totals  of  SPR's  analyzed  for  the  source  projects  are  presented 
throughout  this  document  for  each  individual  investigation  of  errors. 
This  is  because,  depending  on  the  investigation,  the  total  number  of 
documented  problem  reports  may  not  have  been  used. 

2.4  Software  Characteristics 


As  might  be  imagined,  the  characteristics  of  the  software  come  in 
two  forms:  those  that  can  be  quantitatively  measured  and  those  that 
require  some  subjective  evaluation.  In  this  study  both  are  considered 
important,  8oth  are  needed  to  explain  errors.  Both  are  needed  to  under- 
stand the  background  against  which  the  software  is  designed,  coded,  and 
tested;  this  need  carries  into  the  operational  environment,  too.  Examples 
of  both  types  are  presented  in  Table  2-3. 

2.4.1  Software  Structural  Characteristics 


Structural  characteristics  are  measurable.  They  quantify  the  soft- 
ware size,  interface  descriptions,  use  of  the  data  base,  and  use  of 
various  language  elements.  The  approach  taken  in  the  initial  planning 
of  the  S.R.S.  data  collection  task  was  to  provide  as  much  quantitative 
detail  as  possible.  Several  automated  tools  were  available  to  aid  in 
this  task. 

2. 4. 1.1  'TMETRIC  Language  Analyzer 

'TMETRIC  is  a utility  routine  which  statically  analyzes  JOVIAL  J4 
source  code,  breaking  this  code  down  into  its  component  language  elements. 
This  analysis  is  done  at  the  routine  level.  Figure  2-1  presents  sample 
output  for  a routine  called  C0MP012.  The  principal  output  of  'TMETRIC 
is  the  total  size  in  source  code  statements,  the  number  of  logic  branches, 
and  the  distinction  between  executable  and  non-executable  statements. 
Comments  are  distinguished  from  statements. 

In  an  effort  to  tie  specific  types  of  problems  (Section  4.3)  to 
types  of  routines  and,  finally,  types  of  code  within  a routine,  four 
generic  types  of  executable  code  have  been  arbitrarily  defined. 

I/O  - I/O  refers  to  JOVIAL  defined  and  SYSTEM  defined 

input  and  output  statements.  JOVIAL  I/O  state- 
ments include  FORMIN,  FORMOUT,  DECODE  and  ENCODE. 
SYSTEM  DISC  I/O  includes  'SOAHA  and  its  various 
entrances.  Examples  of  SYSTEM  TAPE  I/O  are 
* CWRITE,  'WEOF  and  'REWIND. 

COMPUTATIONAL  - These  are  statements  expressing  equations  contain- 
ing arithmetic  operators. 

Example:  AA  = BB*CC**2/DD  $ 
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DATA  HANDLING  - These  statements  effect  a simple  data  transfer 
{equality)  from  one  variable  to  another  and  are 
distinguished  from  computational  statements. 

Examples:  XX=YY  $,  AA($BB+2,  DO$)  = *PR  $. 

LOGICAL  - Logical  statements  establish  branches  in  the  code 

and  include  the  IF,  IFEITH,  ORIF,  FOR  and  GOTO 
SWITCH  statements. 

Additional  considerations  in  the  'TMETR1C  output  are  as  follows: 

• EXECUTABLE  statements  are  distinguished  from  NON-EXECUTABLE  data 
declarations,  procedure  declarations,  etc. 

• Ll=,  L2a,  etc.,  refer  to  the  number  of  IF,  IFEITH,  and  OiUF  state- 
ments indentured  1 level,  2 levels,  etc.  LL*  includes  levels  of 
indenture  of  5 or  greater. 

t NI=,  N2»,  etc.,  refer  to  the  number  of  FOR  loop  statements  nested 

1 deep,  2 deep,  etc.  NN  includes  depths  of  nesting  of  5 or 

greater. 

t PROC  CALL  includes  internal  and  external  (SUBR)  procedure  calls. 

• Bl=,  B2=,  etc.,  refer  to  the  number  of  SWITCH  declarations  that 

contain  1 branch,  2 branches,  etc.  BB  includes  SWITCH  branches 

of  5 or  greater. 

• TOTAL  BRANCHES  includes  all  possible  logical  branches,  resulting 
from  IF,  IFEITH,  ORIF  and  GOTO-SWITCH  name  statements.  This  does 
not  reflect  the  actual  number  of  logical  branches  the  program 
will  make  when  it  executes. 

The  four  code  types  described  above  appear  to  be  general  enough  to 
be  applicable  to  Projects  4 and  5,  but  two  code  types  were  added  to  cover 
the  PWS  language  of  Project  4: 

• Information  and  program  format  control  statements 

• Instrumentation  statements  for  compilation  and  identification. 

The  macros  in  the  PWS  language  have  been  classified  according  to  the  four 
generic  types  and  the  two  additional  types  given  above.  Project  4 diag- 
nostics output  routine  usage  of  the  PWS  macros.  From  this  a breakdown 
of  code  type  by  routine  can  be  determined. 

2.4. 1.2  interface  Definitions 


Routine/routine  and  routine/data  base  interface  descriptions  are 
generally  available  from  system  utility  or  construct  routines.  To  this 
can  be  added  details  of  the  individual  interface,  e.g.,  number  of  argu- 
ments in  the  calling  sequence,  the  type  of  interface  (applications,  system, 


user,  data  base)  and  the  format  of  information  passed.  The  level  of 
embedment  of  the  "distance  from  the  executive"  can  be  determined  and  may 
give  us  a clue  to  why  routines  are  used  incorrectly  or  why  some  routines 
suffer  from  a large  number  of  inaccurately  written  SPR's. 

0 

2.4.2  Subjective  Characteristics 

These  characteristics  provide  the  background  necessary  for  under- 
standing the  software  problems.  Here  again  the  approach  was  to  collect 
any  information  that  wasn’t  nailed  down;  however,  in  the  course  of  defining 
common  questions  that  could  be  asked  of  each  developer,  the  term  "difficulty" 
kept  cropping  up.  When  asked  to  describe  a routine's  complexity  we  typi- 
cally describe  its  type  (e.g.,  a utility  routine)  and  quickly  launch  into 
a description  of  how  difficult  it  is,  was,  or  will  be  to  produce.  Ask  the 
same  question  of  a tester  or  maintainer  of  software  and  the  response  is 
likely  to  be  the  same.  The  point  to  be  made  here  is  that  we  don't  know 
how  to  describe  software  complexity.  We  do  have  a pretty  good  and  remark- 
ably universal  understanding  of  what  makes  a routine  "difficult."  There- 
fore, questions  related  to  subjective  routine  characteristics  were  limited 
to  asking  about  routine  type  and  difficulty.  Work  unit  managers  were 
requested  to  assign  difficulty  ratings  to  each  of  the  following  disciplines; 
they  were  also  asked  to  provide  additional  comment  concerning  the  routine 
and  the  difficulty  encountered  during  development  and  test: 

• Design  - preliminary  and  detailed  design  work,  the  design  reviews, 
requirements  definition  and,  all  work  done  prior  to  coding. 

• Coding  - the  process  of  transferring  the  flow  documented  in  the 
detailed  design  documentation  into  source  code. 

• Checkout  - debug,  checkout,  and  development  testing  of  the 
routine,  including  preparation  of  special  test  drivers  and 
debug  code. 

• Implementation  - work  necessary  to  make  the  routine  interface 
with  other  routines,  the  OS,  and  the  C0MP00L;  effort  required 
to  provide  data  base  inputs;  the  human  interfaces  with  other 
developers,  the  test  group,  configuration  management,  etc. 

• Documentation  - preparation  of  all  documents  related  to  the 
routine,  excluding  status  reports. 

Routine  type  is  assigned  according  to  the  primary  function  of  the  routine. 
These  are  given  below. 

t CON  - control  or  executive  routine 

• INP  - input  routine 

a SET  - setup  or  initialization  routine 
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Table  2-3.  Available  Parameters 


Routine  Structural  Characteristics 

• Routine  size 

- Total  source  code  statements 

- Executable  statements 

- Non-executable  statements 

- Machine  dependent  number  of  instructions 

• Number  of  branches 

• Number  of  direct  interfaces 

- With  other  applications  routines 

- With  operating  system  routines 

• Number  of  arguments  in  interface  calls 

• Data  interfaces 

- Number  of  global  data  blocks 

- Number  of  internal  data  variables 

• Number  of  procedures 

• Number  of  entry  points 

• Number  of  exit  points 

• Routine  code  type 

- % computational 

- % logical 

- % data  handling 

- % I/O 

• Loop  and  nesting  levels 

• Branch  statement  (IF)  nesting  levels 

• Number  of  comments 

• Pages  of  documentation 

• Computer  time  (clock  time,  not  CPU  time) 

- Development,  time 

- Test  time 
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Table  2-3.  Available  Parameters  (continued) 


j 

ubjective  Characteristics  * 

_____  j 

• Routine  difficulty  at  preliminary  design 

• Routine  difficulty  after  formal  test  and  delivery 

- Design 

- Code 

- Debug/ chs.r lut 

- Implement! on 

- Documentation 

• Routine  type 

~ Executive 

- Control 

- Setup 

- input 

- Computational 

- Post  processing 

- output 

• Personnel  data 

- Number  of  people  working  on  routine 

- Load  factor  on  each  programmer 

- Programmer  rating 

- Programmer/ job  evaluation 


• P/C  - primarily  computational  routine 

t P/P  - post-processing  routine 

• OUT  - output  routine 

• UTL  - utility  routine 

A near  similar  rating  of  difficulty  for  Project  3 routines  is  avail- 
able from  the  pre-detailed  design  period.  Comparison  will  be  made  to  the 
post-delivery  difficulty  rating  to  investigate  differences  and  any  possible 
relation  to  the  types  of  problems  encountered. 

2 . 5 Test  Characteristics  for  Projects  2,  3 and  4 

This  source  of  information,  believed  to  be  essential  to  the  study  of 
software  reliability,  has  proved  to  be  a major  disappointment  in  that  test 
data  of  the  required  detail  is  not  available  for  Projects  2,  3,  and  4. 
Although  information  about  numbers  of  tests,  the  type  of  test,  and  how  many 
problems  each  uncovered  is  available  and  will  be  examined  for  Project  3,  a 
quantitative  measure  of  how  much  of  the  software  was  tested  is  not  available. 
That  is,  we  cannot  determine  how  much  stress  the  test  program  put  the  soft- 
ware through.*  Aside  from  being  a disappointment  to  the  S.R.S.  this  lack 
of  information  is  really  a fundamental  criticism  of  the  test  program  since 
it  is  key  to  the  question  of  determining  how  much  testing  is  "enough." 

When  formal  testing  was  over  it  could  be  said  that  all  the  requirements 
were  satisfied,  a "large"  number  of  the  additional  software  capabilities 
were  demonstrated,  and  when  exercised  in  the  operational  environment,  the 
software  correctly  processed  operational -like  data  within  the  allotted  time 
limit.  It  is  impossible,  however,  to  say  what  percentage  of  the  code  was 
exercised. 

Testing  for  Projects  2 and  3 was  conducted  in  five  distinct  test 
periods,  each  with  a general  test  goal.**  These  periods  are  described  in 
the  following  paragraphs. 

No  test  data  were  available  for  Project  4 since  it  is  an  operational 
program. 


*Measures  of  stress  other  than  the  amount  of  code  exercistd  are 
1)  computer  time  used  for  testing,  see  Section  4.0,  2)  number  of 
test  executions,  and  3}.  number  of  analysts  available  to  review 
test  results.  Item  2),  although  available,  is  considered  a poor 
parameter  because  test  reruns  generally  did  not  execute  new  code. 
Item  3}  is  also  available  and  may  help  to  explain  both  the  number 
and  type  of'  SPRs  generated. 

**Project  2 actually  employed  four  of  these,  eliminating  the  opera- 
tional demonstration  because  Project  2 is  operational  software 
under  frequent  update. 
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2.5.1  Development  Test 

Test  cases,  formal  only  to  the  extent  that  they  were  documented,  were 
written  and  executed  by  the  development  personnel.  The  goal  during  this 
test  period  was  to  demonstrate  specific  functional  capabilities,  test  data 
extremes  and  singularities,  trigger  output  of  all  error  messages,  test  the 
operator  interface,  and  produce  all  output  formats  in  the  appropriate  media. 
Use  of  special  drivers,  debug  code,  and  instrumentation  techniques  was 
practiced.  This  testing  was  begun  after  a routine  was  compiled  and  debugged, 
i.e.,  the  routine  would  load  and  cycle.  Tests  were  structured  in  a bottom-up 
fashion;  once  routines  were  sufficiently  tested,  they  weie  grouped  and  tested 
as  functions.  Functions  were  grouped  for  the  subsystem  tests. 

2.5.2  Validation  Test  and  Acceptance  Test 

Validation  testing  was  begun  after  a batch  delivery  of  the  software  at 
an  event  called  "internal  delivery."  At  this  point  an  independent  test  group 
took  delivery  of  the  software  and  began  formal  testing  with  a goal  of  demon- 
strating the  approved  software  performance  and  design  requirements.  An  addi- 
tional goal  was  to  demonstrate  selected  software  capabilities  agreed  upon  by 
the  customer  and  the  test  group.  These  included  use  of  operationally  oriented 
unique  data  sets,  selected  limits  tests,  and  use  of  operational  scenarios. 

Key  to  the  conduct  of  validation  and  acceptance  testing  was  the  fact  that  all 
testing  was  performed  on  a master  configuration  and  no  alteration  of  the  code 
was  allowed,  even  addition  of  diagnostic  code  was  not  allowed. 

Validation  and  acceptance  tests  were  run  at  the  subsystem  level  through 
the  main  subsystem  entrance  but  were  designed  to  examine  performance  at  the 
routine,  function,  subsystem,  and  system  levels. 

Acceptance  testing  consisted  of  a rerun  of  a subset  of  validation  tests, 
those  that  specifically  demonstrated  the  software  requirements.  Customer 
acceptance  of  the  software  hinged  on  successful  completion  of  these  tests. 

2.5.3  Integration  Test 


This  test  period  was  conducted  by  an  independent  integrating  contractor 
whose  responsibility  it  was  to  demonstrate  that  the  applications  software 
interfaced  correctly  with  the  operating  system  and  the  system  support  soft- 
ware. Additional  goals  were  to  test  the  following: 

• the  software/operator  interface 

• all  1 i stable,  card,  and  tape  outputs 

t out  of  bounds  input  conditions 

Tests  conducted  during  this  period  were  similar  in  structure  and 
formality  to  the  validation  and  acceptance  tests. 


2-14 


2.5.4  Operational  Demons tration 

The  operational  demonstration  (OD)  was  a short  period  of  testing 
which  followed  an  operational  timeline  and  used  an  operational  data 
base.  The  goal  during  this  period  was  to  demonstrate  the  software  in 
the  operational  environment. 

2.5.5  Project  5 Testing 

The  Project  5 test  program  differs  primarily  in  that  tho  development 
test  period  is  conducted  in  a more  formal  fashion.  More  specifically, 
the  goals  of  the  development  test  period  are  to  1)  demonstrate  require- 
ments allocated  to  the  routine  level,  2)  exercise  all  code,  and  3)  demon- 
strate an  extensive  list  of  functional  capabilities  in  formal  test 
cases.  Extensive  low  level  interface  testing  is  an  additional  goal. 
Development  tests  are  conducted  at  the  routine  and  task  (analogous  to 
the  function  of  Project  3)  levels,  calling  for  drivers  to  be  built  to 
do  this  early  integration  testing. 

Following  this  period  is  a process  integration  test  period  during 
which  formal  tests  are  executed  to  demonstrate  Interfaces  between  tasks 
making  up  subprocesses  and  between  subprocesses  making  up  the  various 
realtime  processes.  This  test  phase  is  conducted  by  an  Independent 
test  team  and  Is  the  period  which  produces  the  problem  reports,  called 
discrepancy  reports,  used  in  this  study.  This  testing  is  actually 
system  level  testing  because  the  entire  software  system*,  interfacing 
with  the  realtime  simulator,  is  exercised. 

Although  no  measure  of  test  thoroughness  was  possible  at  the  process 
or  system  level  of  testing**, test  thoroughness  was  determinable  at  the 
routine  level  through  use  of  a dynamic  path  analysis  tool.  This  allowed 
development  testers  to  assure  that  all  executable  code  had  been  tested 
and  that,  in  a select  portion  of  the  system's  applications  software, 
all  paths  had  been  tested.  This  was  coupled  with  use  of  an  operational- 
like  test  data  base  during  early  routine  level  testing.  The  effective- 
ness of  this  test  thoroughness  will  be  discussed  in  Section  4.7.4. 


*To  the  extent  possible  with  each  successive  iteration  of  the 
top-down  approach. 

**Auto(rated  tools  available  for  measuring  test  *horoughness  ran 
out  of  core  at  the  system  level. 
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I . 5 Personnel  Data 

i valuation  of  personnel  performance  was  limited  to  TRW  programmers 
for  Project  3 up  to  product  delWery  at  the  end  of  the  operational  demon- 
stration. Once  again  this  is  done  because  of  the  incompleteness  of  the 
Project  3 data  set,  coupled  with  the  subjective  nature  of  the  parameters. 
Evaluation  of  test  personnel  was  considered  as  a possible  avenue  of  investi- 
gation but  was  discarded  due  to  our  inability  to  quantitatively  evaluate 
individual  test  cases.*  Incidentally,  evaluation  criteria  proposed  below 
would  work  equally  well  for  the  test  group. 

Selection  of  parameters  was  based  on  several  observations. 

• Programmer  experience,  both  in  years  of  industry  experience  and 
years  of  experience  in  a specific  environment,  was  not  considered 
important. 

• Development  schedules  were  tight,  manpower  sometimes  limited, 

and  crises  frequent.  — 

• All  software  was  delivered  on  schedule  and  was  accepted  by  the 
customer. 

Given  these  observations,  two  general  measures  relating  to  program- 
mer performance  were  considered  important:  1)  prograitmer-spf.cific  criteria 
and  2)  assignment  or  job-specific  criteria.  And,  since  these  measures 
are  not  believed  to  be  independent,  a programmer/job  evaluation  parameter 
was  devised.  Table  2-4  presents  parameters  considered  in  the  study  of 
programmer  characteristics  relative  to  software  quality.  From  these 
parameters  the  following  may  be  determined. 

PROGRAWER  RATING  = (KNOWLEDGE)  + (INTELLIGENCE)  4-  (INITIATIVE) 

+ (RESPONSIBILITY) 


PROGRAMMER/OOB  EVALUATION  = (2.0  - LOAD  FACTOR)  (PROGRAWER  RATING) 


Evaluation  was  made  for  76  TRW  programmers  who  worked  directly 
on  and  were  responsible  for  the  Project  3 software.  This  evaluation  is 
being  done  by  the  programmer's  line  management  after  project  completion. 


*Note:  Evaluation  of  test  personnel  performance  is  a particularly  attrac- 
tive idea  since,  as  we  shall  see  in  Section  4.5.5,  the  programmers 
and  the  independent  test  group  for  Project  3 formed  two  distinct 
colonies  of  data. 
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2.7  Computer  Usage  Data 

Information  concerning  the  amount  of  computer  time  used  during  formal 
testing,  although  not  in  units  of  CPU  time,  is  available  for  Project  3. 
Clock  on  and  off  times  were  recorded  for  every  TRW  run  made  during  the 
development  and  test  periods.  Analysis  of  development  and  test  computer 
usage  consistent  with  documented  problems  discovered  during  testing  is  pre- 
sented in  Section  4,5.5. 

The  subject  of  machine  time  and  the  number  of  problems  encountered 
during  that  time  is  cause  for  lengthy  and  often  heated  discussion.  In  any 
event,  one  thing  is  clear  from  this  study;  if  we  are  to  investigate  the 
matter,  we  must  have  accurate  records  of  wall  clock  and  CPU  times  and  be 
able  to  tie  them  to  specific  runs  of  the  software. 
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3.0  ERROR  DATA  CATEGORIZATION 


This  section  discusses  classification  or  categorization  of  software 
errors.  Category  lists  generated  and  used  during  performance  of  the  Software 
Reliability  Study  are  the  principal  object  of  discussion*  although  general 
comments  and  recommendations  concerning  error  categorization  are  also 
presented. 

3.1  Error  Categories 

For  purposes  of  this  report  an  error  category  is  defined  as  a generic 
description  of  error  type.  The  source  materials  used  in  categorizing 
an  error  are  the  problem  report  and  the  closure  report*  both  of  which 
are  necessary  for  accurate  error  categorization.  Reasons  for  such  a 
general  definition  will  become  apparent  from  the  following  sections* 
which  address  three  aspects  of  dealing  with  categorization  of  software 
problems:  1)  generation  of  categories,  2)  assignment  of  categories  to 
problem  reports,  and  3)  analysis  of  results. 

3.1.1  Generation  of  Error  Categories  for  Projects  2,  3*  and  4 

The  approach  taken  in  generating  the  error  categories  (failure 
categories,  error  codes,  or  defect  categories,  as  they  are  sometimes 
called)  was  to  base  results  on  an  analysis  of  the  SPRs  and  MTM's  rather 
than  to  draw  up  a list  a priori.  This  was  done  for  two  reasons;  existing 
lists  tended  to  be  rather  long  (the  longest  presenting  over  400 
categories)  and  the  fact  that  more  than  one  project  was  being  examined, 
each  with  its  own  characteristics.  Problem  reports  and  closures  varied 
in  information  content,  too. 

Since  Project  3 represents  the  largest  source  of  error  information, 
a pilot  study  was  conducted  on  a sample  of  800  SPRs  to  generate  a list 
of  error  categories.  The  sample  was  selected  from  the  mid-validation 
test  phase  of  Project  3 to  take  advantage  of  SPRs  which  were  written 
after  the  initial  test  personnel  learning  period.  Once  a list  was 
consolidated,  SPRs  from  Projects  2,  3,  and  4 were  categorized  according 
to  the  list.  Additions  and  refinements  to  this  list  were  encouraged, 
but  the  resulting  alterations  were  surprisingly  minimal.  Major  changes 
consisted  of  a regrouping  of  existing  categories.*  Twenty-six  new 
categories  were  added  as  a result  of  Project  4.  Thirteen  of  these  were 
added  because  Project  4 is  an  operational  system  with  many  user  requested 
changes,  and  the  remaining  13  were  added  to  provide  greater  detail  for 
existing  categories  or  to  establish  Proje.c  4-specific  categories. 

It  should  also  be  noted  here  that,  although  major  groupings  tend 
to  be  argumentative,  they  were  selected  to  be  as  compatible  as  possible 
with  techniques  available  for  determining  routine  characteristics  and 
code  type,  Section  2.0. 


* 

Changes  to  this  list  were  forbidden  after  analysis  of  Project  5 error  data 
began. 
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Table  3-1  presents  the  164  error  categories  under  20  major  headings 
or  groups.  Total  numbers  of  occurrence  for  each  category  are  also  given 
for  each  project  and  for  the  four  modifications  to  Project  2 being  consid- 
ered in  this  study.  Detailed  analysis  is  presented  in  Section  4.2.1. 

Note  that  the  level  of  detail  provided  by  a cateqorv  is  indicated 
by  the  level  of  indenture  for  the  category  number.  Major  headings, 
e.g.,  AAOOQ,  are  considered  categories,  but  most  problems  were  assigned 
to  categories  at  a greater  level  of  indenture  (detail). 

A brief  description  of  each  of  the  20  major  groups  is  given 
below.  Explanation  of  selected  categories  within  each  group  is  also 
provided,  where  necessary. 

3. 1.1.1  Computation  Errors 

Computation  errors  were  errors  in  or  resulting  from  coded  equations. 
These  equations  fell  generally  into  two  categories,  those  that  produced 
values  directly  related  to  the  physical  problem  being  solved  by  the 
software  (algorithms,  vector  algebra,  modeling  code,  etc.)  and  equations 
used  in  a bookkeeping  sense  (computation  of  indices,  record  niwbers, 
entry  numbers,  etc.).  Categories  worthy  of  special  mention  are  as 
follows: 

AA040  - This  category  covers  coded  equations  which  were  of  the 

wrong  form,  missing  terms,  or  based  on  the  wrong  physical 
convention.  A special  case  in  this  category  is  AA041 
which  indicates  an  inaccuracy  in  a mathematical  model. 

AA050  - This  category  was  assigned  to  errors  found  through 

comparison  of  output  to  results  from  manual  calculations 
or  previously  validated  sources.  It  is  a highly 
symptomatic  category  resulting  from  non-specific  SPR  and 
MTM  documentation. 

AAG7Q  - Time  calculation  errors  were  peculiar  to  Projects  2 and 
3 and  occurred  frequently  enough  to  generate  three 
categories.  In  general,  these  were  modeling  problems. 

3. 1.1. 2 Logic  Errors 

In  generating  the  logic  error  categories  an  attempt  was  made  to 
tie  errors  to  existing  logical  code  or  the  need  for  logical  v.ode. 
Categories  in  this  group  tend  to  be  highly  symptomatic. 
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86060  - 


BB070 


BB100  - 


BB140 


BBT  50 
BB160  - 
BB170 

BB180  - 


Missing  logic  or  condition  tests  was  a major  problem*  These 
were  errors  resulting  from  a lack  of  code  to  perform  some 
logical  function;  specific  sub-categories  involved  failure 
to  check  Indices,  number  of  entries,  flags,  and  singularity 
values  of  data  items. 

This  category  was  added  for  Project  4 and  is  the  equivalent 
of  BB120  for  Projects  2 and  3.  In  effect  these  categories 
point  to  errors  In  the  logical  code  but  as  a result  of 
problem  documentation  provide  no  further  detail.  The 
categories  remain  separate  because  of  Project  4's  operational 
status  while  BB120  errors  were  uncovered  as  a result  of 
formal  pre  delivery  testing. 

Projects  2 and  3 both  dealt  with  data  items  which  were  set 
Initially  and  updated  periodically.  Status  of  settings  and 
the  propogation  of  values  was  monitored.  Logic  designed 
to  handle  this  was  error  prone. 

This  category  represents  logical  modeling  problems  (to  be 
distinguished  from  computational  modeling  problems}. 

These  categories  criticize  the  logic  design.  It  is  important 
to  note  that  this  criticism  comes  from  review  after  the 
logic  has  been  coded,  not  from  a design  review  based  on 
documentation  only. 

This  category,  a storage  reference  error,  deserves  special 
mention  since  it  was  a software  error  which  produced  results 
looking  like  a hardware  error.  It  is  the  only  category 
of  this  nature  in  the  list. 


3. 1.1. 3 I/O  Errors 

An  attempt  was  made  to  limit  categories  in  this  group  to  errors 
resulting  from  I/O  code  and  to  distinguish  them  from  interface  or  other 
error  categories.  This  turned  out  to  be  very  difficult  since  the  physical 
manifestations  that  are  usually  labeled  I/O  are  generally  symptomatic  of 
other  errors.  For  example,  CC020  (output  missing  data  entries)  could 
be  symptomatic  of  a loop  processing  or  logic  error.  The  goal,  however, 
was  to  establish  categories  relating  to  output  format,  position, 
completeness,  field  size,  and  control.  Categories  in  this  group  are 
believed  to  be  self-explanatory. 
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3. 1.1. 4 Oata  Handling  Errors 

This  group  of  categories  covers  errors  made  in  reading,  writing,  * 

moving,  storing,  and  modifying  data.  An  att  was  made  here  to  limit 
these  errors  to  those  occurring  wholly  with^a  routine  and  not  those 
occurring  across  an  Interface.  Categories  needing  additional  description, 
are  as  follows: 

0D070  - Bit  manipulation  errors  in  Project  3 were  common.  These 

were  generally  encountered  when  a packed  data  item  required 
reformatting. 

DD120  - Bounds  violations  occurred  when  a routine  tried  to  access 
an  area  outside  its  allocated  core  environment. 

3. 1.1. 5 Operating  System/System  Support  Software  Errors 

These  categories  list  errors  discovered  in  the  operating  system  (OS) 
software,  the  compiler,  the  assembler,  and  the  system  support  or 
specialized  utility  software.  The  QS  and  system  support  software  for 
Projects  2 and  3 were  the  same  and  virtually  error  free.  No  OS  problems 
were  encountered  in  the  Project  4 data. 

3. 1.1. 6 Configuration  Errors 

Configuration  errors  were  the  catastrophic  problems  encountered 
when  the  software,  after  undergoing  some  sort  of  modification  (usually 
to  fix  a problem),  failed  to  be  compatible  with  operating  system  or 
remaining  applications  software.  These  errors,  for  the  most  part, 
were  the  product  of  a tight  schedule  and  a failure  to  adhere  to  rigorous 
configuration  management  techniques.  That  is,  in  the  real  world  of 
delivering  software  under  pressure, errors  which  should  have  been  caught 
by  existing  CM  procedures  were  not  noted,  usually  resulting  in  software 
| that  wouldn't  load. 

I Also  in  this  group  is  the  unexplainable  program  halt  (FF030), 

which  is  not  covered  by  the  paragraph  above.  This  category  was  assigned 

Ito  non-repeatable  halts.  Although  possibly  a hardware  problem,  documentation 
did  not  identify  the  problem  source,  and  it  should  be  noted  that  none 
of  these  produced  alteration  of  the  software. 


1 


3-10 


s 


3. 1.1. 7  Interface  Errors 


Interface  errors  have  been  grouped  into  five  major  headings. 

e Routine/ routine  interface  errors  - This  group  contains 
error  categories  at  the  interfaces  between  applications 
software  routines. 

a Routine/system  software  interface  errors  - These  are  errors 
resulting  from  the  interface  between  an  applications 
routine  and  an  operating  system  or  system  utility  routine. 

t Tape  processing  interface  errors  - These  are  errors 
in  handling  magnetic  tapes. 

t User  interface  errors  - This  group  contains  errors  at  the 
user  interface,  including  the  machine  operator,  manual  or 
data  card  inputs,  tape  inputs,  etc.  In  this  group  the 
term  "input  data"  is  used  to  include  all  sources  at  the 
user  level. 

• Data  base  interface  errors  ~ This  group  contains  categories 
which  describe  incompatibilities  between  the  data  base 
structure  and  a setting  or  using  routine(s). 

Categories  in  these  groups  are  generally  self-explanatory. 

3. 1.1. 8 User  Requested  Changes 

This  group  of  categories  was  established  as  a result  of  the  fact 
that  Project  4 was  operational,  and  the  problem  report  was  used  to 
request  changes  and  enhancements  of  capability  based  on  usage  of  the 
delivered  software. 

3. 1 . 1 .9  Preset  Data  Case  Errors 

These  categories  were  assigned  to  SPRs  written  directly  against 
preset  or  constant  data*  in  the  data  base. 

MM01G  - Data  and  operation  request  card  formats  were  described 
in  the  data  base  for  Projects  2 and  3.  Generalized  free 
and  fixed  field  card  processing  utility  routines  used 
these  descriptions. 

MM020  - Information  and  error  message  text  was  specified  in  the 
data  base  for  a generalized  message  processor  for 
Projects  2 and  3. 

■*N0TE“  Preset  data  is  generated  by  the  user  and  remains  constant  or 

serves  as  an  initial  setting  to  be  updated  or  modified  by  a routine. 
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3.1.1.10  Global  Vari&ble/COMPOQL  Definition  Errors 


ibis  group  includes  categories  defining  errors  in  the  specification 
of  global  variables  or  constants,  i.e.,  data  defined  for  use  by  interfacing 
routines,  in  the  environment  of  Projects  2 and  3 these  amount  to  COKPOQl 
definition  errors.  These  variables  are  to  be  distinguished  from  internal 
or  local  variables  used  only  within  a single  routine. 

3.1.1.11  Recurrent  Errors 

Establishment  of  this  group  was  an  attempt  to  assess  the  number  of 
problems  that  were  reopened  (the  fix  did  not  work  when  retested  in  the 
master  configuration)  and  duplicates  of  previous  SPRs. 

3.1.1.12  Documentation  Errors 

This  group  is  possible  because  the  SPR  was  used  to  flag  documentation 
problems.  In  the  environment  of  Projects  2 and  3 the  approved  (code  to)  design 
document  is  part  of  the  deliverable  product  and  is  maintafned  rigorously* 

Later,  the  document  is  updated  to  reflect  validation  and  acceptance 
test  results  and  represents  the  ''as-built'*  design.  This  concept  of  the 
documentation  being  the  design  is  an  important  one  and  one  which  allows 
us  to  quantitatively~3etermine  the  source  of  an  error,  as  we  shall  see 
in  Section  4.2.3. 

3.1.1.13  Requirements  Compliance  Errors 

These  categories  resulted  directly  from  the  fact  that  the  software 
failed  to  provide  a capability  specified  by  the  requirements  document. 

This  does  not,  however,  say  that  the  design  overlooked  the  requirement. 

A thorough  review  of  the  design  against  the  requirements  was  carried 
out  for  Project  2 and  3 prior  to  the  Initiation  of  coding.  Therefore, 
these  categories  point  to  the  fact  that  supposed  "full  capability"  code 
was  net  in  compliance  with  the  requirements  at  the  time  the  SPR  was 
written*. 

It  should  be  noted  here  that  no  SPRs  were  written  against  the 'software 
requirements;  these  were  baselined  and  not  a subject  for  change  in  the 
formal  test  phases  for  Projects  2 and  3. 

3.1.1.14.  Unidentified  Errors 

This  group  recognizes  that  all  problem  reports  and  closures  do  not 
supply  sufficient  information  for  analysis.  SPRs  in  this  group  and  their 
closures  provided  so  little  information  that  no  error  category  could  be 
assigned.  (From  Table  4-1  it  can  be  seen  that  occurrence  of  these  ran  as 
high  as  6.9  percent  for  one  modification  of  Project  2 and  4.0  percent  for 
Project  3.  Project  5 had  no  occurrences  due  to  the  real-time  nature  of 
the  data  collection  process.) 


*Note:  By  the  end  of  acceptance  testing,  which  is  requirements  oriented, 
all  requirements  were  satisfied  by  the  software. 
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3.1.1.15  Operator  Errors 


Operator  errors  were  assigned  when  the  problem  was  due  to  machine 
operator,  developer,  or  tester  error.  An  attempt  was  made  to  separate 
these  errors  from  those  where  the  operator's  error  was  due  to  a problem 
with  the  user  interface,  i.o, , an  error  in  the  user  interface  design. 

3.1.1.16  Questions 

fhe  final  group  was  generated  to  accommodate  the  fact  that  the  SPR 
was  used  as.  a vehicle  for  asking  questions  and  guaranteeing  an  answer. 

3.2.  Observations  on  Category  Generation 

The  approach  recommended  in  generating  error  category  lists  is  basi- 
cally the  one  suggested  in  the  June  1973  report  by  MITRE  entitled  ”A  Soft** 
ware  Error  Classification  Methodology"^  Attempts  to  generate  these  lists 
should  be  open-ended  in  order  to  encompass  specific  characteristics  of  the 
data  being  analyzed.  However,  the  principal  problem  encountered  In  such 
an  approach  is  the  tendency  to  create  a category  for  each  error  analyzed. 
This  one-for-one  phenomenon  also  occurs  whan  categories  are  generated 
through  speculative  techniques.  The  obvious  result  is  very  few  occurrences 
per  category,  and  any  attempt  to  relate  category  frequency  of  occurrence  to 
software  characteristics  proves  inconclusive. 

Once  the  open-ended  list  is  established*  there  is  a tendency  to  con- 
dense the  Hst  by  combining  several  detailed  categories  into  a single  more 
general  category.  The  danger  here,  of  course,  is  in  changing  tl.e  nature 
of  the  error  category  to  the  point  where  some  of  the  categories  making  up 
the  new  category  are  no  longer  accurately  identified  by  that  new  category. 

As  the  industry  gains  experience  in  first  recognizing  the  significant 
characteristics  of  various  types  of  software  and  then  generating  category 
lists,  definition  techniques  will  be  refined  and  the  more  universal  lists 
will  surface. 

3.3  Recommendations  on  Category  Generation 

To  anyone  preparing  to  generate  an  error  category  list  the  recommen- 
dation Is  to  plan  to  do  the  job  at  least  twice.  The  first  attempt  should 
result  in  a fairly  detailed  list;  any  consolidation  of  categories  should 
be  accompanied  by  a second  pass  through  the  data  to  evaluate  the  new  cate- 
gories. This  is  an  iterative  business. 


♦Reference  3. 
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The  reader  will  note  that  error  categories  (Table  3-1)  fall  into 
two  groups*  symptomatic  categories  and  categories  which  identify  the  cause 
of  the  error.  For  example,  1/0  error  CC020  (Output  missing  data  entries) 
is  a symptom  which  could  have  been  caused  by  improper  definition  of  a loop 
variable,  a table  Initialization  error 6 etc.  Computational  error  AA080 
(Sign  convention  error)  points  to  the  cause  of  the  error.  Whether  the  cituse 
is  evident  depends  on  the  information  content  of  the  problem  closure  report. 
Symptoms  are  generally  well  documented  on  the  problem  report.  An  accurate 
statement  of  cause  was  not  guaranteed  in  the  available  data.  This  symptom/ 
cause  situation  prompted  the  general  description  of  an  error  category  of- 
fered earlier. 

The  experience  of  generating  error  categories  leads  the  analysts  to 
conclude  that  two  lists  are  actually  possible,  one  for  the  symptom  and  one 
for  the  specific  cause  of  that  symptom  at  the  code  level.  We  can  speculate 
on  the  eventual  application  of  such  lists.  A purely  symptom* tic  list  could 
be  useful  to  test  groups  as  an  aid  in  knowing  what  symptoms  to  look  for. 

Code  levs!  errors  (causes)  could  aid  In  generation  of  automated  development 
and  test  tools,  Improvement  of  languages,  etc. 

The  observation  that  symptomatic  and  causative  lists  are  possible 
and  the  suggestion  that  there  may  be  a predictable  relation  between  the  two 
Is-  a subject  addressed  Infection  4.4. 

3.4  Assignment  of  Error  Categories 

Assignment  of  error  categories  from  data  in  the  archives  is  something 
of  an  art,  even  when  a problem  is  well  documented.  In  this  study  the  ap- 
proach was  to  read  the  problem  report  and  the  closure  report,  reconstructing 
the  error  from  available  information  and  assigning  an  error  category.  In 
this  after-the-fact -analysis  it  was  obvious  that  a familiarity  with  the 
project  and  the  conditions  under  which  problem  reports  were  generated  is 
essential.  Categorization  of  the  7910  problem  reports  used  in  this  study 
was  done  by  analysts  who  performed  on  the  subject  projects. 

It  is  important  to  note  that  all  problem  reports  were  assigned  to  a 
category,  regardless  of  whether  the  problem  was  a real  one  or  not.  If  not 
a real  problem,  1.e»,  an  exploratory  closure,  the  SPR  symptoms  were  cate- 
gorized. The  thinking  behind  this  approach  was  1)  to  gain  Insight  into  the 
analyst's  understanding  of  the  situation  at  the  time  he  derwented  the 
problem  (some  categories  suffered  a higher  percentage  of  "no  problem" 
responses  than  other  categories),  and  2)  to  recognize  the  fact  that  all 
problem  reports,  whether  actual  problems  or  not,  required  effort  on  the 
part  of  a programmer  to  effect  a problem  closure. 
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Observations  on  Category  Assignment 


The  following  is  apparent  from  the  work  dene  in  this  study: 

• The  fixer  of  the  problem  should  be  the  source  of  the  code 
level  error  category  assignment-  He  alone  is  close  enough 
to  the  problem  to  define  it  in  non-symptom&tic  terms. 

• Error  categorization  should  be  done  at  the  time  the  problem 
is  closed,  not  after  the  fact. 

• Assignment  of  categories  should  be  based  on  the  problem 
report  and  the  closure  report. 

• Long  category  lists  with  fine  granularity  are  hard  to  use 
and  require  practice  to  use  effectively. 

• A change  in  data  collection  philosophy  is  necessary  to 
guarantee  collection  oT  reliability-related  data  at  the 
time  it  is  available  while  satisfying  configuration  manage- 
ment data  requirements.  Hot  only  are  changes  in  the  collec 
tion  forms  necessary,  but  the  authors  of  the  various  forms 
need  to  be  fully  aware  of  the  objectives  of  the  data  collec 
tion  effort. 

3.5  Development  of  Project  5 Error  Categories 


With  the  experience  of  creating  error  categories  for  Projects  2,  3, 
and  4 behind  us  we  set  about  creating  a list  to  be  used  on  Project  5.  Our 
goals,  in  order  of  importance,  were  to  create  a list  that, 

• Fixers  of  errors  would  use, 

a Would  be  causative  in  nature, 

a Would  provide  information  on  each  error's  origin 

a Would  be  similar  enough  to  existing  lists  at  the  major 
category  level  to  allow  comparison  of  results. 

Since  the  list  had  to  recognize  project  peculiarities,  assistance 
in  refining  the  categories  was  sought  from  Project  5 performers.  Early 
attempts  at  generating  such  a list  produced  a very  long  list,  containing 
435  detailed  categories,  which  was  eventually  discarded  as  being  imprac- 
tical and  counterproductive  in  a real  worid  software  development  environ- 
ment.* What  error  data  were  available  at  that  time  were  also  used  to 
generate  the  list. 


Comments  from  the  potential  users  indicated  that  the  list's  length  was 
antagonizing,  too. 


The  resultant  list  is  presented  in  Table  3-2.  Note  that  there  is 
a blank  character  in  the  five  character,  alpha-numeric  error  category 
designator.  This  blank  is  for  the  error's  source.  Once  the  fix  was  made 
and  tested,  the  responsible  programmer  was  asked  to  select  an  error  cate- 
gory that  most  accurately  described  the  cause  of  the  problem.  As  in  the 
previous  list,  indenture  denoted  a level  of  categorization  detail,  and 
programmers  were  instructed  to  assign  categories  only  to  the  level  of 
detail  they  felt  was  accurate.  The  remaining  information  needed  was  the 
error's  source,  which  is  defined  in  the  following: 

Source  Problem 


ID 

Source 

Description 

0 

Design 

The  source  of  the  problem  was  in 
the  preliminary  or  detailed  design. 

1 

Coding 

The  source  of  the  problem  was  an 
error  made  in  implementing  the 
design  as  code. 

2 

Requirements 

The  source  of  the  problem  was  a 
changing,  ill  conceived,  or  poorly 
stated  requirement. 

3 

Maintenance 

The  source  of  the  problem  was  an 
error  introduced  in  the  process  of 
trying  to  fix  a previous  error. 

4 

Not  Known 

Source  of  error  not  known. 

As  an  example  of  categorization 

with  the  Project  5 list,  A2300 

would  be  a sign  convention  computational  error  traceable  to  an  origin  in 
the  software  requirements. 

Referring  back  to  Table  3-2,  note  that  there  still  remain  some 
symptomatic  categories.  A completely  causative  list  makes  for  many 
categories,  and  shortness  was  mandatory. 

Categories  themselves  are,  with  some  exceptions,  self-explanatory. 
Noteworthy  exceptions  are  as  follows: 

I_100  (Operating  system  error).  This  category  was  provided 
because  the  real  time  operating  system  was  an  addi- 
tion to  a vendor  supplied  basic  operating  system. 

This  category  applied  only  to  this  basic  operating 
system. 

0 100  / (Time  and  core  limits  exceeded).  These  categories 

J 200  1 are  Particularly  important  to  real  time  software 

- ’ systems  where  there  may  be  no  error  per  se.  The 

problem  in  these  cases  is  a potential  one,  that  the 
system  may  be  too  slow. 
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X0005  {Problem  report  deferred).  Provision  for  this  was 

necessitated  by  the  multiple  build,  top-down  approach 
being  taken  in  the  Project  5 environment.  In  manv 
cases  authors  of  problem  reports  are  able  to  foresee 
problems  with  future  iterations  of  the  development 
cycle.  This  category  made  it  possible  to  identify 
these  and  track  their  closure  at  a future  date. 
important  feature  of  a thorough  configuration  manage** 
ment  discipline  of  a top-down  approach. 

As  with  the  preceding  error  category  lists,  not  all  Project  5 pro- 
blem reports  document  errors,  an  error  being  somothing  requiring  corrective 
action  in  the  form  of  a change  to  the  software  (source  code  or  data  base 
or  us  documentation.  Detailed  analysis  of  the  frequency  of  occurrence 
of  various  types  of  errors  is  presented  in  Section  4.2.1  for  each  of  the 
source  projects  In  that  section  particular  emphasis  is  placed  on  the 
errors  which  produced  a change  in  the  code  or  in  the  data  base.  These 
we  termed  code  change  errors  or  "actual"  problems. 
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Table  3-2,  Project  5 Error  Categories 


AjOOO 

COMPUTATIONAL  ERRORS 

A 100 

Incorrect  operand  in  equation 

A"200 

Incorrect  use  of  parenthesis 

A 300 

Sign  convention  error 

A 400 

Units  or  data  conversion  error 

A 500 

Computation  produces  an  over/under  flow 

A 600 

Incorrect/inaccurate  equation  used 

A 700 

Precision  loss  due  to  mixed  mode 

A 800 

Missing  computation 

AJ300 

Rounding  or  truncation  error 

8J300 

LOGIC  ERRORS 

B 100 

Incorrect  operand  in  logical  expression 

B 200 

Logic  activities  out  of  sequence 

B~300 

Wrong  variable  being  checked 

B 400 

Missing  logic  or  condition  tests 

B 500 

Too  many/few  statements  in  loop 

B 600 

Loop  iterated  incorrect  number  of  times 

(including  endless  loop) 

B_700 

Duplicate  logic 

C_000 

DATA  INPUT  ERRORS 

C 100 

Invalid  input  read  from  correct  data  file 

C 200 

Input  read  from  incorrect  data  file 

C 300 

Incorrect  input  format 

C 400 

Incorrect  format  statement  referenced 

C 500 

End  of  file  encountered  prematurely 

CJOO 

End  of  file  missing 

D_000 

DATA  HANDLING  ERRORS 

0 050 

Data  file  not  rewound  before  reading 

0 100 

Data  initialization  not  done 

0 200 

Data  initialization  done  improperly 

0 300 

Variable  used  as  a flag  or  index  not  set  properly 

D 400 

Variable  referred  to  by  the  wrong  name 

D 500 

Bit  manipulation  done  incorrectly 

0 600 

Incorrect  variable  type 

0 700 

Data  packing/unpacking  error 

0 800 

Sort  error 

0 900 

Subscripting  error 
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Table  3-2.  Project  5 Error  Categories  (Continued) 


E 000 


E 100 
£""200 
E”300 
E~400 
E"- 500 
EJOO 
E 700 

eHbqo 


F 000 


0ATA  OUTPUT  ERRORS 
Data  written  on  wrong  file 

Data  written  according  to  tbs  wrong  format  statement 

Data  written  in  wrong  format 

Data  written  with  wrong  carriage  control 

Incomplete  or  missing  output 

Output  field  size  too  small 

Line  count  or  page  eject  problem 

Output  garbled  or  misleading 

INTERFACE  ERRORS 


F 100 
F~200 
FJOO 

FJOO 
F~500 
FjSOO 
f“ 700 


G 000 


Wrong  subroutine  called 

Call  to  subroutine  not  made  or  made  in  wrong  place 
Subroutine  arguments  not  consistent  in  type,  units, 
order,  etc. 

Subroutine  called  is  nonexistent 
Software/data  base  interface  error 
Software  user  interface  error 
Software/software  interface  error 

OATA  DEFINITION  ERRORS 


H 000 


G_100  Data  not  properly  defined/dimensioned 

G 200  Data  referenced  out  of  bounds 

GJOO  Data  being  referenced  at  incorrect  location 
GJOO  Data  pointers  not  incremented  properly 

30  DATA  BASE  ERRORS 

H_100  Data  not  initialized  in  data  base 

H_200  Data  initialized  to  incorrect  value 

H 300  Data  units  are  incorrect 


I 000 


IJOO 

I 200 

I 300 
IJOO 
IJOO 
I 600 


OPERATION  ERRORS 

Operating  system  error  (vendor  supplied) 

Hardware  error 

Operator  error 

Test  execution  error 

User  misunderstanding/error 

Configuration  control  error 
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Table  3**2.  Project  5 Error  Categories  (Continued) 


JjQOO  OTHER 

J_100  Time  limit  exceeded 

iT200  Core  storage  limit  exceeded 

J~300  Output  line  limit  exceeded 

J_400  Compilation  error 

JJ500  Code  or  design  inefficient/not  necessary 

J__600  User/prograimer  requested  enhancement 

J~700  Design  nonresponsive  to  requirements 

JJBOO  Code  delivery  or  redelivery 

J_900  Software  not  compatible  with  project  standards 

K_0OO  00CW®HTAT'.l)H  ERRORS 

'<_10O  Ust>r  map.uir? 

K~200  I<*ir.erf-Gcc  specification 

ftiSTcn  ziu&;  \ Heat  ion 
K~4Q0  fequir«vr«,it i specification 

\j00  Test  dgcwssatatlon 

xoooo  msiM  (tow  rejection 


XOGGA  No  n*:o b)siv 

X0002  Vo  -I df  wi  t’Wrawn 

X0003  (Hu  p?  icope  - not  part  of  approved  design 

X0004  Duplicates  another  problem  report 

X0005  mizrrt'A 

■ ■■■■■  ■!  — A •—»  ‘ ■■  II  —■ 


3-20 


3.6  History  of  Error  Category  Lists 


it  is  instructive  to  summarize  our  various  attempts  to  create  error 
category  lists  because  it  helps  to  put  some  of  our  problems,  at  least  those 
related  to  list  length,  in  perspective.  Such  a summary  is  presented  in 
Table  3-3  below.  These  range  from  early  (circa  1971)  work  done  in  support 
of  CCiP-85  to  current  work  done  on  this  study. 


Table  3-3.  History  of  Frror  Category  Lists 


Major 

Categories 

Detailed 

Categories 

Iteration 

Conments 

13 

0 

Study  done  in  support 
of  CCIP-85 

« Entirely 
symptomatic 

13 

224 

Post-CCIP-85  in-house 
work 

1 

j 

* Greater  emphasis 
on  cause 

j t 

! • Failed  to 
recognize  code 
types 

20 

164 

Interim  technical  report 
for  Software  Reliability 
Study  (Table  3-1) 

■ 

• Predominately 
symptomatic 

• Recognized 
types  of  code 

• Assignment  not 
by  problem 
fixer 

25  » 435 

| 

I 

1 

i 

Early  causative  work 

• Generated  by 
speculation 

• Long  with 
redundancies 

• Hard  to  use 

12 

j 

i 

I 

1 

79 

Final  causative  list 
(Table  3-2) 

* Comprehensive 
but  short 

• Easy  to  use 

# Problem  fixer 
assigns 
categories 
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Note  that  the  length  of  these  lists  has  progressed  from  the  very  short, 
when  software  problems  were  not  fully  understood,  to  the  very  long,  when  the 
interest  in  detail  overshadowed  all  other  considerations,  back  to  a rela- 
tively short  list,  which  is  comprehensive  and  informative  but  recognizes 
the  need  for  ease  of  use  in  any  attempt  to  collect  accurate  error  data. 

The  question  of  universality*  of  these  lists  is  one  for  which  we 
have  only  limited  answers.  The  list  used  for  Project  5 is,  in  many  respects, 
similar  to  the  one  used  for  Project  3.  This  was  by  design.  The  major  cate- 
gories, e.g.,  logical,  computational,  interface  errors  etc.,  appear  to  be 
quite  universal  in  their  applicability.  The  detailed  categories,  however, 
are  less  universal  and  suffer  in  applicability  due  to  differences  in  lan- 
guage, development  philosophy,  software  type,  etc.  When  data  are  collected 
may  also  have  a bearing  on  applicability  to  some  software  test  environments. 
For  Project  5 the  list  used  was  apparently  adequate  for  the  real  time  appli- 
cations and  simulator  software,  as  well  as  the  Product  Assurance  tools. 
However,  there  was  criticism  concerning  applicability  of  detailed  categories 
to  the  real  time  operating  system  software**  problems.  No  attempt  was  made 
to  identify  specific  deficiencies  because,  even  with  the  instructions  to 
categorize  only  to  an  accurate  level,  i.e.,  stick  to  the  major  categories, 

' the  detailed  categories  were  used,  and  checks  of  assignments  showed  them 
to  be  appropriate. 


* 

Applicability  to  more  than  one  project. 

This  software  is  written  in  assembly  language;  all  other  software  is 


FORTRAN. 


in 
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4.0  ANALYSIS  OF  ERROR  DATA 


This  section  contains  the  bulk  of  analysis  of  data  collected  from  the 
four  source  projects.  The  principal  object  of  this  analysis  has  been  soft- 
ware error  data  collected  during  testing  and  operational  usage.  However, 
where  other  information  has  been  available  it  too  has  been  used  to  aid  in 
the  analysis  of  error  data. 

In  Section  7.0,  Data  Collection,  we  point  out  that  software  projects 
have  a potential  for  creating  a tremendous  amount  of  data.*  Our  experi- 
ence, when  confronted  with  data  from  four  projects,  was  much  like  the  first 
trip  to  a candy  store,  i.e.,  where  do  we  start  and  how  much  do  we  consume? 

We  quickly  realized  that  to  deal  with  all  the  data  would  be,  in  a practical 
sense,  impossible.  It  also  became  obvious  that  data  collected  after  the 
fact  leaves  much  to  question  in  terms  of  accuracy  and  completeness,  as  well 
?s  availability  on  other  projects.  A third  early  lesson  was  that  promis- 
ing results  on  one  project  may  not  be  borne  out  by  similar  investigations 
on  other  projects.  Therefore,  a decision  was  made  to  work  principally  with 
projects  for  which  data  were  most  current  and  complete,  and  to  follow 
avenues  of  investigation  believed  to  be  most  productive  in  terms  of  use- 
ful results.  The  result,  has  been  that,  although  findings  from  all  four 
source  projects  are  presented,  Projects  3 and  5 are  most  thoroughly  repre- 
sented by  this  section,  and  by  the  remainder  of  this  report.  Project  3 
provided  the  most  complete  (and  final)  data  set  and  Project  5,  an  on-going 
project,  provided  an  opportunity  to  tailor  data  collection  activities  to 
study  needs. 

4 . 1 Approaches  to  Analysis  and  Background  Information 

Two  basic  approaches  were  taken  in  analyzing  the  error  data.  The  first 
was  to  examine  the  raw  problem  reports  or  empirical  data  to  learn  as  much 
•is  possible  s;wut  the  types  of  errors,  how  and  when  they  were  detected,  and 
to  identify  trends.  Results  are  presented  in  Section  4.2.  Specific  ques- 
tions asked  during  the  course  of  analysis  were  as  follows: 

• What  was  the  error  type  or  category? 

§ When  was  the  error  detected? 

• Should  the  error  have  been  detected  earlier? 

t When  was  the  error  introduced? 

• How  critical  v/as  the  error  to  successful  software  operation? 

• How  long  did  it  take  to  fix  the  error? 


■ k 

Project  3 alone  produced  5619  design  problem  reports;  4521  software  pro- 
blem reports;  cost,  schedule,  manpower,  and  personnel  data  for  76  program 
mers;  and  software  structural  information  for  115,346  source  statements 
forming  249  routines. 
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As  part  of  the  error  categorization  process,  it  was  noted  in  Sec- 
tion 3.4  that  error  categories  can  be  causative  or  symptomatic,  depending 
on  the  viewpoint  of  the  test  analyst.  That  is,  the  finder  and  fixer  of  an 
error  often  view  the  error  differently.  The  finder  sees  symptoms,  while 
the  fixer  thinks  in  terms  of  cause.  A brief  analysis  of  the  relation 
between  symptoms  and  causes  is  presented  in  Section  4.4. 

The  second  approach  to  analysis  was  termed  the  phenomenological  f 

approach.  Statistical  in  nature,  this  approach  investigates  the  relation 
of  software  attributes  and  metrics*  to  error  histories.  A detailed  descrip- 
tion of  this  approach  and  results  of  analysis  are  presented  in  Section  4.3. 

Finally,  in  conjunction  with  this  examination  of  error  data,  a number 
of  ancillary  investigations  using  other  available  empirical  data  were 
performed.  Results  of  these  Investigations  are  presented  in  Section  4.5. 

* 

Before  proceeding,  one  piece  of  information  concerning  software  struc- 
ture is  germane  to  the  understanding  of  results  presented  throughout  this 
section.  It  is  also  useful  Information  to  anyone  planning  to  analyze 
empirical  data  on  other  projects.  It  will  be  noted  that  the  unit  of  soft- 
ware modularization  varies  depending  on  the  particular  investigation  being 
pursued.  In  some  instances  for  Project  3,  results  are  presented  at  the 
routine  level,  i.e.,  one  data  point  per  routine.  In  other  instances  results 
are  presented  at  the  function  or  subsystem  levels,  where  routines  are 
grouped  to  perform  a single  function  or  a number  of  related  functions, 
respectively.  This  is  because  not  all  trends  were  observable  at  the  rou- 
tine level.  Some  trends  that  were  obscure  at  the  routine  level  became  -» 

well  correlated  when  routines  were  grouped  functionally  (e.g.,  data  base 
management  function)  or  according  to  purpose  (e.g.,  computational  routines). 

That  groupings  by  purpose  produced  well  correlated  results  (in  some  cases) 
is  not  particularly  surprising.**  Results  from  functional  groupings  are  not 
so  surprising  either,  once  one  realizes  that  routines  in  the  same  function 
tended  to  have  the  same  schedule,  manpower,  and  implementation  problems. 

In  fact,  Project  3 management  was  set  up  in  conjunction  with  the  software  i 

structure  so  that  personnel  assigned  to  one  work  unit,  ranging  in  size 

from  5 to  15  programmers,  produced  all  the  software  in  one  or  more  of  the 

functions.  Also,  software  requirements  allocated  to  a function  were  all 

of  equivalent  detail  and  represented  similar  problems  to  the  function 

developers.  The  Project  3 structure  is  presented  in  Figure  4-0. 

Although  the  other  projects  had  functionally  oriented  architectural 
structures,  * Project  3 is  the  only  one  with  results  presented  in  this  form. 

Project  4 results  are  presented  for  the  project  as  a whole,  and  Project  5 
results  are  presented  according  to  applications,  simulator,  operating  system, 
and  software  tools.  Since  each  of  these  Project  5 groupings  is  so  large,  f 

both  in  software  size  and  in  development  organization  size,  each  is  consid- 
ered as  a separate  project. 

* 

Attributes  identify  specific  software  characteristics,  such  as  size, 
complexity,  and  readability,  and  metrics  quantify  these  attributes. 

See  Section  4.3. 

*** 

Project  2 s was  nearly  identical,  with  functions  and  subsystems  perform- 
ing in  very  similar  generic  functions. 
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Figure  4*0.  Project  3 Software  System  Structure 


4 . 2 Analysis  of  Empirical  Oa^a 

This  section  contains  results  of  our  analysis  of  software  errors,  how 
many  there  were,  what  type  they  were,  when  they  were  found,  and  their 
origin  in  the  development  cycle.  Where  necessary  (and  possible)  supporting 
information  is  drawn  Into  the  analysis  to  explain  trends. 

4.2.1  Error  Types  and  Freouency  of  Occurrence 


Using  the  error  category  lists  described  In  Section  3.0,  we  attempted 
to  answer  the  questions  "what  were  the  types  of  errors?"  and  "how  many  of 
each  type  were  there?"  Our  analysis  of  the  number  of  occurrences  of  each 
type  took  place  on  two  levels;  first,  at  the  major  category  level,  and 
second,  at  the  detailed  category  level.  And,  as  was  explained  earlier, 
particular  emphasis  was  placed  on  "actual"  problems  or  the  errors  which 
required  an  alteration  to  the  code  or  data  base  to  effect  corrective  action; 
although  all  documented  problems  were  tracked  and  are  represented  at  the 
major  category  level  of  analysis. 

Data  from  all  four  source  projects  are  presented.  As  a refresher, 
the  following  table  summarizes  project  characteristics  again.  Note  that 
the  Project  5 software  is  broken  down  into  its  component  portions;  each 
of  these  portions  (there  are  others  on  this  large  project)  is  essentially 
a project  all  by  itself  because  of  differences  in  language,  development 
personnel,  software  requirements,  etc. 


Software  Type  [ Operating  Mode  Language  Development  Approach 


Project  2 Command  and  Batch 
Control 


Project  3 Command  and  Batch 
Control 


Project  4 Data 

Management 


Time  Critical 
Batch 


Project  5 Applications  I Real-Time 
S/W  I 


Simulator 

S/W 


Operating 

System 


PA  Tools 


Real-Time 


Real-Time 


JOVIAL  Single  Increment* 
J4 


JOVIAL  Single  Increment* 
J4 


Operational 


Top-Down  Multiple 
increment 


Top-Down  Multiple 
Increment 


Top-Down  Multiple 
Increment 


FORTRAN 


FORTRAN 


Assembly 


Batch 


FORTRAN  Single  Increment* 


m 

Single  increment"  refers  to  a typical  development  cycle  where  each  develop- 
ment phase  is  performed  only  once.  This  is  in  contrast  to  the  top-down, 
multiple  increment  approach  where  the  cycle  is  repeated  several  times,  first 
for  a system  of  stubs  and  subsequently  for  replacement  of  stubs  with 
deliverable  software. 
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It  should  also  bo  restated  that  available  error  data  for  the  four 
source  projects  do  not  come  from  the  same  type  of  testing.  However,  the 
levels  of  testing  (subsystem  and  system)  are  similar.  That  is,  the  soft- 
ware under  test  or  operation*  was  being  exercised  as  a system  or  major  sub- 
system. Therefore,  even  though  Project  5 data  come  from  integration  tcst- 
{nj,  the  applications  software,  the  operating  system,  and  the  simulator  are 
operating  as  a system  with  a near-operational**  data  base. 

Occurrences  of  Errors  for  Projects  2,  3,  and  4 

figure  4-h  presents  a percentage  breakdown  by  major  category  for  the 
four  modifications  to  Project  2***,  the  Initial  development  of  Project  3, 
ar.d  the  res*-*  >f  operational  errors  for  Project  4.  Raw  data  are  presented 
in  Section  3.»  >.Jong  with  the  description  of  Project  3 detailed  error  cate- 
gories. TabW  3-1  presents  raw  data  at  the  major  category  level. 


Percentage  breakdowns  are  what  might  be  expected,  given  the  types  of 
software  and  the  test  environments.  For  example.  Project  4 had  virtually 
no  computational  errors  (< 1. 35)  because  there  is  very  little  computational 
code  in  this  data  management  system,  the  bulk  of  the  code  being  logical  or 
data  handling  in  nature.  Percentages  for  these  categories  reflect  this 
fact  at  26.0  and  20.4  percent,  respectively. 

Similarities  between  the  various  updates  of  Project  2 and  Project  3 
show  up  in  Figure  4-1  also.  For  these  systems  the  percentage  mix  of  errors 
is  roughly  the  same  for  a given  major  category.  Variations  which  do  exist,  < 
e.g. , the  differences  in  the  percentage  of  data  handling  errors  between 
modifications  M0D1B  and  M0D1BR  of  Project  2,  can  be  explained  by  a close 
look  at  the  type  of  code  being  added  in  the  modification  or  created  in  the 
initial  development.  That  is,  if  data  handling  code  is  the  largest  compo- 
nent of  the  change,  the  percentage  of  data  handling  errors  can  be  expected 
to  be  large  compared  to  percentages  for  other  types  of  code1-.  In  this 
example  MOOiBR,  a singl e purposed  new  capability,  was  predominantly  addi- 
tion of  logical  and  data  handling  code.  It  was  a small  update  with  only 
one  new  routine  added,  therefore  there  was  only  one  new  routine/routine 
interface.  However,  there  was  an  added  demand  placed  on  the  user  at  the 
user  interface.  The  effects  of  each  of  these  may  be  seen  in  Figure  4-1 
by  a comparison  to  other  updates  to  Project  2.  M0D1B,  on  the  other  hand, 
was  a fairly  substantial  update  with  additions  and  changes  representing 
virtually  all  cede  types  and  significant  interface  alterations,  resulting 
in  a spreading  of  error  occurrences  to  more  categories. 


^Project  4. 

To  the  extent  possible  at  the  current  point  «n  the  development  cycle. 
MQD1A,  MQD1B,  MOOIBR,  and  M0D2. 

*This  relationship  is  quantitatively  explored  in  Section  4.3. 
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Figure  4-1-  Percentage  Occurrence  of  Error  Category  Groups 
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Figure  4-1.  Percentage  Occurrence  of  Error  Category  Groups  (Continued) 
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Figure  4-1.  Percentage  Occurrence  of  Error  Category  Groups  (Continued) 
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Figure  4-1.  Percentage  Occurrence  of  Error  Category  Groups  (Continued) 
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Figure  4-3.  Project  5 Integration  Test  Results  for  the  Real-Time  Simulator  Software 


Other  points  of  interest  in  Figure  4-1  are  listed  below. 

e Very  few  operating  system  and  system  support  errors  were 
documented  for  any  of  the  source  projects.  With  the 
exception  of  update  M0D1BR  for  Project  2,  only  the  "tried 
and  true"  operating  system  and  system  support  capabilities 
were  used.  M0D1BR  utilized  OS  software  service  capabili- 
ties not  conmonly  used,  and  use  of  these  services  uncovered 
some  errors.  The  operating  system  in  each  of  these  soft- 
ware system  environments  could  be  considered  error-free  in 
each  case. 

• Although  percentages  for  interface  erro-s  varied  from 
project  to  project,  a total  of  all  interface  errors*  was 
nearly  constant  for  all  three  projects,  ranging  from  a 
low  of  11.6  percent  to  a high  of  15.7  percent  of  the  total 
errors.  No  specific  reason  can  be  given  for  this 
phenomenon,  however. 

• Configuration  errors,  where  the  configuration  control  pro- 
cedures broke  down  under  the  pressure  of  implementing  fixes 
to  errors,  was  a fairly  constant  percentage  for  each  pro- 
ject in  the  Project  2 and  Project  3 environment.  A low 

of  0.6  percent  and  a high  of  1.9  percent  was  observed. 

Note  that  in  the  lower  pressure  Project  4 operational 
maintenance  environment  no  such  errors  were  encountered. 

A more  detailed  examination  of  these  data  are  presented  in  subsequent 
paragraphs  of  Section  4.2. 


Occurrences  of  Errors  for  Proj( 


;ions  Software 


Figures  4-2  and  4-3  present  the  percentage  breakdown  by  major  cate- 
gory for  the  Project  5 real-time  applications  software  and  simulator  soft- 
ware, respectively.  These  histograms  present  four  entries  for  each  major 
category,  one  for  each  "increment"  in  the  top-down  approach.  Increments  are 
identified  as  0,  1,  2,  and  3,  where  increment  0 was  a system  made  up  of 
stubs  or  dummies  and  subsequent  increments  have  added  needed  system  capabil- 
ity by  successively  replacing  stubs  and  dummy  software  with  real  code.  Data 
for  the  real-time  operating  system  and  product  assurance  tools  are  not 
presented  here  due  to  the  fact  that  data  cannot  be  considered  complete. 

Our  principal  purpose  in  presenting  the  data  by  increment  was  to  show 
any  variations  in  the  mix  of  errors  over  time  as  a result  of  the  top-down 
approach.  In  Figure  4-2,  total  problems  for  the  applications  software, 
there  aopear  to  be  some  increment  dependent  variations.  Increment  0,  the 


Composed  of  routine/routine,  routine/ system  software,  tape  processing, 
data  base,  and  user  interface  errors. 
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duinny  processor,  was  composed  of  very  little  computational  code,  and  data 
output  code  was  of  the  debug  variety*  i ,0,-,  to  hs  replaced  with  deliverable 
coue.  It  is  not  surprising  that  there  were  no  errors  in  these  two  major 
categories  during  Increment  0.  Oata  input  code  was  part  of  the  Increment  0 
software,  and  the  data  show  a corresponding  presence  of  data  input  errors 
in  Increment  0. 

The  trends  for  some  categories  in  Figure  4-2  appear  to  be  tapering  off 
while  others  are  rising.  For  example,  in  Increments  1 and  2 most  of  the 
computational  code  was  added  to  the  applications  software.  Most  of  the  data 
handling  code  was  created  in  the  early  increments  also.  As  a result,  per- 
centages for  both  of  these  categories  have  dropped  in  the  lacer  increments, 
presumably  because  the  earlier  code  has  been  tested  and  errors  removed, 
while  lesser  amounts  of  code  in  these  categories  is  being  added  in  the  later 
increments.  Interfaces  have  increased  with  each  increment  and  there  has 
been  a corresponding  rise  in  the  number  of  interface  errors,  although  there 
appears  to  be  a decline  in  the  percentage  of  interface  errors  for  Incre- 
ment 3*,  which  is  consistent  with  the  fact  that  fewer  interfaces  were  added 
than  in  Increment  2.  Logical  «nd  data  definition  error  percentages  appear 
to  be  remaining  relatively  constant,  regardless  of  increment. 

Finally,  one  very  positive  observation  that  can  be  made  in  Figure  4-2 
is  the  relatively  large  and  increasing  percentage  of  data  base  value 
"errors".  Use  of  an  operational  data  base  during  integration  testing  is 
being  encouraged  partly  to  create  and  tune  an  operational  data  base  in 
order  to  avoid  the  risk  of  waiting  until  just  prior  to  format  system  test- 
ing to  attempt  creating  such  a data  base.  In  the  main,  these  are  not  neces- 
sarily errors  but  alterations  in  data  values  made  necessary  by  tuning  for 
best  operational  performance.  As  the  top-down  approach  progresses  through 
the  increments,  the  data  base  category  takes  on  an  increasing  percentage  of 
errors,  even  though,  as  is  indicated  in  Figure  4-2,  the  number  of  errors 
per  increment  is  essentially  constant. 

Occurrences  of  Errors  for  Project  5 Simulator  Software 

Variations  in  the  percentage  mix  of  errors  for  the  four  increments  of 
the  simulator  software  are  not  obvious,  although  some  similarities  to  the 
applications  software  results  may  be  seen  in  Figure  4-3.  For  instance, 
there  is  a build  up  and  eventual  tai 1 off  in  the  percentage  of  computational 
and  data  output  errors.  The  trend  of  data  input  errors  is  also  similar  to 
that  experienced  for  the  applications  software.  Logical  and  data  handling 
errors,  however,  differ  in  their  variation  over  a period  of  four  increments, 
possibly  because  of  the  early  introduction  of  computational  code  in  the 
simulator  software.  Note  that  the  percentage  of  'data  base  errors,  as  for 
the  applications  software,  is  significant,  again  demonstrating  the  impor- 
tance of  preparing  an  operational  data  base  early  in  the  development  cycle. 


* 

Available  data  show  that  the  percentage  increase  in  the  number  of  inter- 
face errors  is  not  proportional  to  the  number  of  new  interfaces  being  added 
by  each  increment.  We  hypothesized  that  the  smaller  routines  made  necessary 
by  Project  5 size  standards  would  result  in  a proportionally  greater  number 
of  interface  errors.  However,  these  preliminary  data  indicate  no  penalty 
in  terms  of  interface  errors  and  that  our  hypothesis  was  not  correct. 
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Our  findings  that  the  occurrences  of  problem  reports  are  related  to 
the  predominant  type  of  code  is  not  a surprising  one,  and  is  a relation- 
ship that  we  will  pursue  in  greater  detail  in  Section  4.3.  In  the  next 
paragraph  we  will  look  in  greater  detail  at  the  code  and  data  base  change 
errors  in  major  categories  for  each  of  the  source  projects  and  then  we  will 
look  at  the  occurrences  of  errors  in  the  most  prevalent  detailed  categories. 

Comparison  of  Projects*  Major  Error  Categories 

Table  4-2  presents  a percentage  breakdown  of  errors  by  major  category 
for  errors  which  resulted  in  a code  change.  In  this  table  data  from  Proj- 
ects 3 and  4 have  been  regrouped  to  be  compatible  with  the  more  accurately 
collected  Project  5 data*.  Note  that  for  each  of  tne  subject  projects  code 
change  errors  tended  to  fall  primarily  into  the  following  four  major  cate- 
gories: logic,  data  handling,  data  I/O,  and  interfaces.  If  data  input  and 
output  categories  are  lumped  together,  as  is  the  case  with  Project  3,  and 
the  category  called  "Other  Errors"  is  not  considered,  the  order  of  impor- 
tance based  on  relative  percentage  magnitude  is  surprisingly  similar  in 
some  respects  and  understandably  different  In  others,  Table  4-3. 

It  is  significant  that  logical  errors  ranked  high  for  each  project, 
regardless  of  software  type,  language,  operating  mode,  or  other  project 
differences.  It  is  also  significant  that  data  handling  errors  ranked  a 
close  second  for  all  projects  except  the  highly  analytical  applications 
and  simulator  portions  of  Project  5,  which  experienced  computational  errors 
in  place  of  the  data  handling  errors.  As  for  the  computational  category 
on  the  other  projects,  its  relatively  low  position  in  the  order  of  prece- 
dence was  due  to  a thorough  understanding  of  algorithms  in  Project  3 and 
the  fact  that  computational  code  did  not  form  a very  large  part  of  the 
software  on  Project  4. 

In  Table  4-4  the  percentages  are  given  for  each  major  category  as  a 
function  of  increment  to  show  changes  due  to  the  top-down  approach. 

Occurrences  of  Detailed  Errors 


Having  examined  the  major  categories  and  found  some  similarities 
between  projects,  the  next  step  was  to  examine  the  detailed  error  cate- 
gories to  determine  the  predominant  detailed  categories  within  the  major 
categories.  Again  this  investigation  centered  only  on  those  documented 
problems  which  caused  a change  to  the  code,  i.e.,  the  actual  errors. 


★ 

Project  2 is  not  considered  here  because  data  come  from  successive  updates 
to  an  existing  system  rather  than  from  an  initial  development.  Even  so,  as 
we  have  seen,  results  are  much  like  Project  3 results. 
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Percentage  Breakdown  of  Code  Change  Errors 
Into  Major  Error  Categories 
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Oirect  comparison  of  detailed  error  categories  between  Project  3 and 
Project  5 is  made  difficult  by  the  fact  that  the  error  category  list  used 
for  Project  5 differs  from  that  used  in  analyzing  data  from  other  projects.* 
Therefore,  results  are  presented  separately,  but  to  make  the  comparison 
easier,  detailed  categories  for  Projects  3 and  4 have  been  grouped  where 
similarities  warrant  such  grouping.  These  results  appear  in  Table  4-5 
where  entries  are  percentages  of  the  major  category.  Table  4-6  presents 
a similar  percentage  breakdown  of  Project  5 major  categories  for  each  of 
the  four  portions  of  the  Project  5 software  being  studied.  It  will  be 
noted  in  Table  4-6  that  not  all  detailed  categories  are  represented.  In 
this  table  only  those  categories  representing  a significant,  greater  than 
5 percent,  portion  of  the  major  category  for  at  least  one  portion  of  the 
Project  5 software  are  presented.** 

For  both  Projects  3 and  4 the  highest  occurrences  of  computational 
errors  were  in  computing  entry  numbers,  indices,  and  flag  settings.  Per- 
centages ran  26.3  and  57.1  percent,  respectively.  Hone  of  the  portions  of 
Project  5 experienced  this  type  of  error,  possibly  due  to  the  detailed 
routine  level  testing  done  before  integration  testing.  Sign  convention 
errors  ranged  from  2.7  percent  to  9.1  percent  in  the  Project  5 data  and 
were  at  5.0  percent  for  Project  3.  Project  4 had  no  occurrences.  Units 
conversion  error  percentages  were  surprisingly  similar  at  8.1,  8.3,  and 
9.3  percent  for  the  analytical  Projects  3 and  5 and  14.3  percent  for 
Project  4.  The  highest  percentages  of  computational  errors,  however, 
fell  in  the  incorrect  or  inaccurate  equation  categories. 

Logical  errors,  the  major  category  with  the  most  occurrences  for  all 
projects,  fell  predominantly  in  the  missing  logic  or  condition  test  detailed 
category  for  all  projects.  Percentages  ranged  from  39.6  percent  for  the 
Project  5 simulator  software  to  76.6  percent  for  the  Project  5 Product 
Assurance  tools.  Of  all  the  errors  encountered  this  one  represents  the 
most  significant  found  in  the  whole  study.  Incorrect  logic  (logical  results 
were  wrong)  or  incorrect  logical  sequence  activities  were  next  most  impor- 
tant for  all  projects.  Combined,  these  ranged  from  14.9  to  45.0  percent  of 
the  logical  category.  Surprisingly,  errors  involving  loops  and  endless 
loops  were  not  very  common,  representing  only  5.2  percent  of  the  Project  3 
logical  errors  and  even  less  for  the  other  projects. 

For  all  projects  the  predominant  input/output  detailed  error  cate- 
gories had  to  do  with  incorrect  formats  and  output  that  was  either  garbled 
or  didn't  correspond  with  the  design  specification.  Missing  output  and 
output  missing  entries  were  also  common  in  the  I/O  major  category. 


As  mentioned  in  Section  3,  this  was  done  1)  to  simplify  the  categoriza- 
tion process  in  order  to  make  it  acceptable  to  project  performers,  and 
2)  to  make  it  as  causative  as  possible.  The  Project  5 error  categories 
did  evolve  from  the  Project  3 list,  however,  and  common  categories  do 
exist  at  both  the  major  and  detailed  levels. 

This  was  done  to  cut  down  on  the  number  of  zero  and  near  zero  entries. 
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Data  handling  errors,  next  to  logical  errors  in  occurrences,  were 
predominantly  errors  in  initializing  and  updating  data.  Occurrences  in 
these  categories  formed  65.7  and  77.3  percent  of  the  major  category  for 
Projects  3 and  4,  respectively,  and  ranged  from  26.3  percent  for  the  sim- 
ulator software  to  82.3  percent  for  the  operating  system  software  for 
Project  5.  Errors  in  initialization  of  flags  and  indices  were  « large 
problem  on  all  of  the  source  projects. 

Similarities  between  projects  in  the  interface  categories  are  harder 
to  find  because  of  the  differences  between  category  lists.*  However, 
from  Tables  4-4,  4-5  and  4-6,  It  can  be  seen  that  rout ine/rou tine  calling 
sequence  (argument)  errors  and  errors  in  data  compatability  were  most 
cornnon. 

For  Project  5 tape  processing  is  not  a very  significant  activity. 

For  Projects  3 and  4,  however,  tape  handling  was  a significant  task.  Why 
all  Project  3 errors  fell  in  one  detailed  category  is  a mystery. 

Remaining  detailed  categories  exhibit  percentage  breakdowns  that 
aren't  so  comparable.  But  these  unique  data  are  useful  by  themselves 
along  with  other  error  percentages  in  defining  tools  and  techniques  to 
improve  the  development  and  test  processes  (see  Section  4.7). 


* 

One  of  the  major  simplifying  concessions  made  on  the  Project  5 list  was 
in  the  area  of  interface?. 
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Table  4-5.  Detailed  Error  Category  Breakdown  for  Projects  3 and  4 (Continued) 
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Table  4-5.  Detailed  Error  Category  Breakdown  for  Projects  3 and  4 (Continued) 
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Table  4-5.  Detailed  Error  Category  Breakdown  for  Projects  3 and  4 (Continued) 
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Table  4-6.  Project  5 Detailed  Error  Category  Breakdown 
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Table  4-6.  Project  5 Detailed  Error  Category  Breakdown  (Continued) 
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4.2.2  When  Here  Specific  Errors  Found? 


All  error  data  analyzed  In  this  study  were  the  result  of  formal  testing 
or,  In  the  case  of  Projects  3 and  4,  operational  use.  By  formal  testing  we 
mean  that  testing  performed  after  the  programmer  finishes  his  debug  and 
checkout.  Formal  testing  is  documented  through  plans  and  procedures  and  Is 
designed  to  demonstrate  some  predefined  test  objectives,  e.g..  Interface 
integrity,  requirement  satisfaction,  or  use  In  the  operational  environment. 

Software  test  personnel  put  forth  the  contention  that  they  find  cer- 
tain types  of  errors  first  (e.g.,  aborts,  endless  loops,  major  interface 
errors,  etc.),  and  once  the  software  is  cycling  properly,  attention  Is 
turned  to  detailed  examination  of  such  things  as  output  accuracy,  demonstra- 
tion of  requirements  an;1  timing  performance. 

As  part  of  the  analysis  of  Project  3 data,  an  attempt  was  made  to 
determine  If  1)  specific  types  of  errors  are  found  at  certain  times  within 
each  test  phase  and  2)  if  these  types  could  be  traced  to  a particular  type 
of  test  case  «_*r  test  strategy. 

First  of  all,  the  contention  that  certain  types  of  errors  are  found 
first  was  not  borne  out  by  the  data.  Error  categories  appeared  to  be  distri- 
buted In  time  across  each  test  phase  in  such  a way  that  no  trends  were  noted. 
This  was  also  true  for  the  one  test  phase  represented  in  the  Project  5 data. 

A close  examination  of  the  text  of  Project  3 SPR's  indicated  numerous 
instances  where  Individual  test  analysts  (by  name)  documented  problems  of 
similar  error  type  over  a period  of  a day  or  two.  That  Is,  having  detected 
> certain  type  of  error,  e.g.,  an  entry  computation  error  based  on  a mis- 
understanding of  whether  the  first  entry  was  the  Oth  or  1st  entry,  there 
was  a tendency  for  that  analyst  to  go  looking  for  the  same  type  of  error 
elsewhere  in  the  code.  No  attempt  was  made  to  quantify  the  extent  to  which 
this  phenomenon  occurred;  however,  It  is  believed  that  this  could  have  some 
very  positive  effects  on  the  rate  and  completeness  of  error  discovery.  In 
spite  of  a diversion  of  the  test  analyst's  attention  away  from  a broader 
search  for  all  types  of  errors  and  especially  if  the  test  analyst  is  inti- 
mately familiar  with  all  the  code  produced  by  an  Individual  progranmer. 

To  get  a picture  of  the  relative  magnitude  of  documented  —rors  as  a 
function  of  time,  Figures  4-4  through  4-8  are  presented.  These  histograms 
are  from  Projects  2 and  3 and  depict  the  buildup  of  problem  reports  partly 
due  to  a "learning  curve"  on  the  part  of  the  test  personnel  and  partly  due 
to  the  serial  nature  of  the  software  and  test  cases.  That  is,  not  all  of 
the  software  can  be  tested  at  once  due  to  the  need  to  develop  data  bases  in 
early  test  cases  for  input  to  subsequent  test  cases.  The  typical  tailoff 
of  problem  reports  as  the  backlog  of  errors  detectable  by  a specific  test 
strategy  are  found  and  corrected  is  also  visible.  Note  that  successive 
test  phases,  each  representing  a different  set  of  principal  test  objectives, 
produce  a new  buildup,  peak,  and  tailoff  of  problem  reports  over  time. 
Hopefully,  each  peak  is  lower  than  its  predecessor,  but  as  may  be  seen, 
this  was  not  always  the  case. 
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Figure  4-7.  Software  Problems  Encountered  During  Testing 
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Although  principal  test  objectives  varied  from  phase  to  phase,  there 
was  considerable  overlap  and,  admittedly,  duplication.  As  will  be  pointec 
out  in  Section  4.7,  test  cases  in  each  phase  were  extended  purposely  to 
broaden  the  scope  of  the  error  detection  process.  For  example,  the  prin- 
cipal objectives  of  validation  testing  were  to  verify  that  major  software 
capabilities  documented  in  the  approved  design,  including  those  necessitated 
by  software  requirements,  were  present  and  operating  correctly.  Yet,  test 
case  designers  also  addressed  selected  interface  tests  and  operational 
scenarios  using  operational -like  data.  Tests  executed  during  acceptance  or 
requirements  testing;  system  Integration  testing,  where  a concerted  effort 
was  made  to  exercise  Interfaces  and  test  anomalous  conditions  to  "break" 
the  software;  and  operational  demon stra tie ns,  where  tests  followed  an  opera- 
tional timeline,  were  all  quite  similar.  Each  phase  detected  errors  from 
every  major  error  category.  And  more  importantly,  each  phase  caught  errors 
which  should  have  been  detected  earlier.  This  expansion  of  test  objectives, 
coupled  with  the  changes  in  personnel  performing  and  analyzing  tests 
(thereby  bringing  in  a "fresh"  viewpoint),  is  believed  to  have  been  very 
beneficial  to  the  overall  quality  of  the  delivered  product. 

4.2.3  When  Were  Errors  Introduced? 

Although  all  errors  being  considered  in  thi1*  study  were  found  in  the 
code  during  test  or  operational  usage  of  the  software,  not  all  are  neces- 
sarily coding  errors.  Some  may  be  traced  to  other  sources.  Four  develop- 
ment activities  have  been  identified  as  sources  of  error. 


• requirements  specification 

• design 

• coding 

• maintenance  (correction  of  other  errors) 

In  trying  to  determine  where  each  error  was  introduced,  the  tempta- 
tion was  great  to  go  down  the  detailed  error  category  list  and  judiciously 
assign  a source,  i.e.,  assign  a probable  source.  For  Project  3 this  was 
done,  since  no  attempt  was  made  to  determine  error  source  as  each  SPR  was 
written.  Results,  summarizeo  by  major  error  category,  are  presented  in 
Table  4-7  for  the  2019  errors  which  produced  a change  in  the  Project  3 
code.  Note  that  only  design  and  coding  sources  are  assigned.  This  is 
because  the  list  of  four  sources  given  above  actually  breaks  down  to 
either  design  or  coding  sources  for  Projects  2 and  3.  That  is,  no  good 
data  exist  on  the  number  of  errors  introduced  as  a result  of  correcting 
previously  documented  errors,  and  errors  attributed  to  a source  in  soft- 
ware requirements  were  probably  recorded  as  design  errors  in  this  investi- 
gation, although  no  problem  report  specifically  cited  requirements  in 
either  project's  data.  This  is  partly  due  to  the  fact  that  requirements 
were  foimally  reviewed,  approved,  and  baselined  prior  to  the  beginning  of 
design  and  held  unchanging  for  the  remainder  of  the  devel oir’ent  cycle. 
However,  the  assumption  should  not  be  made  that  the  requirements  were 
flawless,  even  in  this  type  of  controlled  development  environment  where 
requirements  were  formally  specified  and  well  understood  by  both  customer 
and  development  contractor  as  well.  This  is  because,  in  the  collection  of 
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Table  4-7.  Project  3 Error  Sources 


MAJOR  ERROR  CATEGORIES 


* OP 
TOTAL 
CODE 
CHANGE 
ERRORS 


PROBABLE  SOURCES 


PERCENT  PERCENT 
DESIGN  CODE 


1 ' 


Computational  (AA) 

Logic  (BB) 

I/O  (CC) 

Dc*-a  Handling  (DD) 

Operating  System/System  Support  (EE) 
Software 

Configuration  l EE*  I 

Routine/Routine  Interface  (5G) 

Rou tine/Sys tern  Software  Interface  (Hit) 

Tape  Processing  Interface  (II) 

Us*r  Interface  (JJ) 

Data  Base  Interface  (KK) 

User  Requested  Change  (LL) 

Preset  Data  Base  (MM) 

Global  Variable/Compool  (NN) 

Definition 

Recurrent  (PP) 

jocumentation  (QQ) 

Requirements  Compliance  (RR) 

Unidentified  (SS) 

Operator  (TT) 

Questions  (UU) 


Averages 


NOTES:  (1)  Although  errors  in  these  categories  required  o.anges  to 
the  code,  their  source  breakdown  of  design  versus  code  is 
not  attempted  here.  Those  categories  considered  in  all 
other  categories  encompass  95  percent  of  all  code  change 
errors. 

(2)  For  T reject  3 product  enhancements  or  changes  to  the 

design  baseline  were  considered  "out-of-scope"  and,  there- 
fore are  not  present  here. 
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supporting  data  to  explain  software  error  histories,  poorly  stated  require- 
ments or  changing  interpretation  of  requirements  were  offered  as  reasons 
for  difficulty  in  developing  several  error-prone  routines. 

k quantitative  breakdown  between  the  design  and  coding  sources  can 
be  deiived  from  Project  2 data  by  virtue  of  the  fact  that  design  docu- 
mentation for  this  project  was  very  detailed,  to  the  extent  that  flow  charts 
contained  d tail  at  the  source  code  level.  This  detailed  design  documenta- 
tion went  through  a formal  review  and  customer  approval  cycle  prior  to 
coding  of  each  Project  2 update.  For  the  most  part,  the  coding  task 
Involved  transferral,  rather  than  translation,  of  the  approved  design 
into  source  code  Since  Project  2 is  continually  being  updated,  rigorous 
controls  are  used  to  maintain  the  design  documentation  in  a current 
state.  One  of  these  is  the  Documentation  Update  Transmittal  or  DDT. 

For  Project  2 every  software  problem  winch  necessitates  a change  to 
the  design  documentation  requires  a OUT.  Therefore,  to  get  a feel  for 
the  percentage  of  errors  that  are  attributable  to  design,  the  Project  2 
problem  report  history  can  be  used  if  two  assumptions  are  naae. 

1)  The  detailed  design  specification  is  the  approved 
design. 

2}  Any  errors  which  require  a change  to  the  code  and 
necessitate  a change  to  the  detailed  design  specT- 
fication  are  to  be  considered  design  errors.  Those 
requiring  only  a code  change  are  coding  errors. 

Data  from  seven  updates  to  the  Project  2 software  are  presented  in 
Table  4-8. 

The  approaches  taken  above  support  the  contention  that  errors  in 
one  category  might  be  traced  back  to  a source  in  some  previous  develop- 
ment phase,  e.g.,  that  not  all  errors  in  index  computation  are  coding 
errors.  However,  since  analysis  was  done  retrospectively,  we've  had  to 
work  wi*->i  what  data  are  available,  and  accuracy  h3P  -suffered.  Project  5, 
on  the  other  hand,  has  allowed  us  an  opportunity  i'or  data  collection 
requirements  to  both  increase  the  scope  of  the  e trees  and  the  accu- 

racy of  the  source  assignments.  That  is,  by  get.  alysis  help  from 
project  performers,  a further  breakdown  of  error  including  require- 

ments specification  and  maintenance,  has  been  possible.  Since  it  is  not 
always  possible  to  determine  where  the  error  was  introduced,  a fifth  "not 
known"  category  was  introduced  to  cover  this  situation/ 


Project  5 data  provides  a unique  opportunity  to  examine  the  variation 
of  error  source  as  a result  of  the  top  down,  multiple  increment  approach  to 
c'evalopment.  It  is  also  a unique  opportunity  from  the  standpoint  or  soft- 
ware requirements,  since  unlike  Projects  2 and  3,  the  Project  5 software 
requirements  were  not  as  well  understood  at  project  outset  and  have  been 
th«»  subject  of  continual  controlled  restatement  and  refinement.  Project  5 
can  be  considered  a state  of  the  art  realtime  software  system  with  highly 
complex  and  detailed  requirements.**  In  this  respect,  it  is  typical  of 


A fact  of  life  in  real  world  data  collection. 

*• 

In  sheer  numbers  alone  Project  5 has  1165  software  requirements  as 
opposed  to  188  for  Project  3.  Both  projects  employ  baseline  manage- 
ment techniques  with  requirements  reviews  and  maintenance  of  traceability. 
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Table  4-8.  Project  l Error  Sources 


MODIFICATION 

no,  of  :gurce 
STATEMENTS  IN 
MODIFICATION 

TOTAL  ERRORS 
ENCOUNTERED 

PERCENT 

DESIGN 

ERRORS 

PERCENT 

COOING 

ERRORS 

MODI  A 

12S3 

152 

73  6 

26.4 

MOD  18 

9880 

156 

73.7 

26.3 

MCD1BR* 

779 

73 

35.6 

64.4 

MOD? 

9631 

419 

51.6 

4S.4 

MOOS 

457S 

199 

58.8 

41.2 

MODS.! 

113 

61.9 

38.1 

M0u3.2 

#* 

120 

65.8 

34.2 

*M0D1BR  was  a small  "retrofit'*  update  to  Project  2t  which  may  have 
had  something  to  do  with  the  low  desian  error  percentage. 

Size  data  not-  accurate. 

the  current  trend  for  rigorous  requirements  specification  becoming  more 
common  in  the  industry. 

The  oata  examined  were  from  689  Project  5 code  change  problem  reports 
written  during  integration  testing.  Four  portions  of  the  Project  5 code 
are  represented  in  the  following  tables: 

• real-time  applications  software  (Table  4-S) 

• real-time  simulator  software  (Table  4-10) 

• batch  mode  Product  Assurance  Tools  (Table  4-11) 

• realtime  aerating  system  software  (Table  4-11) 


These  values  are  presented  only  as  a sample  of  operating  system  soft- 
ware data.  Oata  mus*-  n<rt  be  considered  complete  because  Project  5 is 
not  complete  and  development  schedules  have  not  provided  a good  break- 
ing point  for  analysis. 


4-41 


Taote  4-9.  Project  5 Error  Sources  by  Increment 
(Applications  Program  Software) 


PERCENT  CODE  OR  DATA  BASE 
CHANGE  ERRORS  FUR  EACH  INCREMENT 


TOP  DOWN 


! INCREMENT  INCREMENT  INCREMENT  INCREMENT 


Requirements 

a*.  mmmmmmm  ■-»  — 

Design 


Code 


Maintenance* 


Not  Known 


4_i 


Code  or 
Data  Base 
Change  Errors 

23 

85 

78 

89 

275 

Total  Errors 
Reported** 



44 

114 

101 

118 

377 

This  refers  co  errors  introduced  as  a result  of  fixing  previously 
documented  errors. 

These  include  non-cotie  or  v'ata  base  change  errors  such  as  operator 
errors,  documentation  errors,  .*nd  "no  problem"  reports. 


r.ilijr  <1-10.  Project  5 Error  Sources,  by  Increment 
(Simulator  Software) 


PlRCINT  CODE  OR  DATA  3ASL 
CHANGE  ERRORS  FOR  EACH  INCREMENT 


\ TOP  OWN 
\ ITERA- 
\ TION 

ERROR  \ INCREMENT  INCREMENT  INCREMENT  INCREMENT 
SOURCE  \ 0 1.2  3 


Pequi remen ts 


Design 


Code 


\ 

| Maintenance* 


Not  Known 


Code  or 
Data  Base 
Change  Errors 

20 

29 

87 

— 

81 

225 

Total  Errors 
Reported** 

29 

32 

98 

128 

287 

This  refers  to  errors  introduced  as  a result  of  fixing  previously 
documented  errors. 

** 

These  include  non-code  or  data  base  change  errors  such  as  operator 
errors,  documentation  errors,  and  "no  problem"  reports. 

Rm'uUs  for  the  vari  »us  portions  of  the  Project  5 system  are  not  con- 
i'* tout,  as  may  be  seen  in  the  following  tables.  Several  trends  do  appear 
«n  the  data,  however. 

1)  For  each  of  the  software  packages  being  developed 
in  a top-down,  multiple  increment  approach  (Tables  4-9 
and  4-10)  the  percentage  of  errors  attributable  to 
requirements  is  increasing,  although  the  total  num- 
ber of  errors  found  per  increment  is  remaining  fairly 
constant.  A closer  look  at  these  errors  shows  that 
they  fail  principally  in  the  logical v computational, 
interface,  and  data  structure  categories.  However, 
the  text  of  these  problem  reports  doesn  t necessarily 
point  to  an  error.  Project  5 is  so  complex  that  in 
some  areas,  such  as  performance  (timing  and  accuracy) 


I\*ble  4-11.  Project  5 Errc1"  Sources  (Batch  M«de  PA  7 mis  and 
Real-Time  Operating  System  Software) 


Requirements 


PERCENT  CODE  OR  DATA  2ASE 
CHANGE  ERRORS 

i 

BATCH  MODE 
PA  TOOLS 

REAL-TIME  OPERATING 
SYSTEM  SOFTWARE* 

0.9 

7.4 

r 

Design 

} 

Cede 

...  . 

{ 

3 

Maintenance** 

Not  Known 


Code  or  Data 
Base  Change 
errors 

108 

81 

Total  Errors 
Reported 

131 

99 

Values  in  this  column  can  not  be  considered  final. 

k 

This  refers  to  errors  introduced  as  a result  of  fixing  previously 
documented  errors. 


requirements,  it  is  necessary  to  build  a portion  of 
the  system  and  play  it  off  against  a real-time  sim- 
ulator to  verify  that  requirements  are  valid.  The 
percentage  of  these  "errors"  goes  up  because  the 
number  of  requirements  implemented  increases  with 
eash  successive  increment.  The  multiple  Increment 
approach  has  allowed  continual  validation  of  require- 
ments along  with  validation  of  the  seftwarr. 

The  breakdown  of  des’gn  and  coding  errors  is  not  the 
same  as  that  seen  for  Projects  2 and  3,  partly  due  to 
the  larger  number  of  categories  being  evaluated  and 
partly  due  to  the  nature  of  Project  5.  Jote  in 
Table  4-9  that  the  percentage  of  design  errors  has 
gone  down  to  10.1  percent  in  Increment  3 (from  a high 
of  95.7  percent  in  the  predominantly  stub  structure  of 
Increment  0).  This  may  be  a desirable  feature  of  the 
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multiple  increment  approach  since  coding  errors  have 
been  shown  by  Shooman  and  Bolsky  [4]  to  be  less 
costly  to  diagnose  and  correct  than  design  errors.* 
Their  results  showed  that  design  errors  required  an 
average  of  3.1  manhours  to  diagnose  and  4.0  manhours 
to  correct,  Coding  errors  required  an  average  cf 
2.2  manhours  to  diagnose  and  0.8  manhours  to  cor- 
rect. This  result  is  not  surprising  since  early 
increments  effectively  establish  the  major  design 
features, 

3)  Maintenance  errors,  i.e.,  those  errors  resulting 
from  the  correction  of  previously  documented  errors, 
reached  a maximum  of  9.0  percent  of  all  code  change 
errors  in  one  case.  A practical  norm  for  this 
error  source,  however  is  probably  in  the  range  of 
from  2 to  5 percent. 


t 

i 

i 

I 


l 


* 

Unfortunately,  we  were  unable  to  collect  this  type  cf  information  during 
the  S.R.S. 
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0.2.4  Preopera tional  vs  Operation*!  Problems 

One  of  the  popular  beliefs  held  by  test  personne1  in  the  Project  3 
rnvirornent  is  that  the  routines  which  are  error  prone  during  preopera tional 
testing  are  also  the  routines  which  are  error  prone  once  the  software  is 
operational.  Project  3 operational  data  collected  during  fcKis  study  made 
it  possible  to  investiga*  the  accuracy  of  this  belief. 

Figure  4-9  shows  nUa  of  operational  problems  versus  preoperationul 
problems  for  the  249  Frojivt  J routines.  From  this  we,see  that  the  correla- 
tion isn’t  very  strong  at  the  routine  level,  r * 0.406  . Figure  4-10, 
which  presents  the  same  data  at  the  function  level  shows  a stronger  correla- 
tion with  r * 0.920.  Results  at  the  routine  level  pro! ably  would  have 
been  much  improved  had  an  assessment  of  problem  criticality  been  available 
for  use  in  the  investigation. 

One  interesting  fact  was  revealed  in  the  course  of  trying  to  identify 
reasons  wh;  routines  were  statistical  outliers,  i.e.,  falling  outside  the 
90  percent  confidence  limits  in  Figure  4-9.  Turning  to  the  routine  diffi- 
culty information  collected  early  In  the  S.P..S.,  an  attempt  was  made  to 
determine  why  routines  were  cuvliers  above  the  regression  line*’,  l.e.,  had 
a higher  rnrnber  of  operational  problems  compared  to  other  routines.  First, 
the  average  ’’difficulty  to  develop”  of  the  nine  high  outliers  is  12.1  while 
the  average  for  all  of  TRW's  174  routines  was  9.2,  where  the  difficulty 
range  is  from  a minimus,  of  5 t,o  a maximum  of  15  A summary  of  difficulty 
data  for  ihe  nine  outliers  fs  given  in  ~ahle  4-12  below. 

Table  4-12.  Difficulty  Ratings  for  Operational  Outliers 


Difficulty 


Implement  Checkout  Document  Totals 


Deleting  routines  with  zero  problems  didn't  improve  the  correlation. 

Only  one  routine  was  in  outlier  below  the  regression  line.  This 
was  judged  to  be  a difficult  routine*  to  develop  (difficulty  = 14) 
but  "thoroughly"  teeted.  The  chief  reason  given  was  that  it  was 
an  easily  modularised,  primarily  computational  routine.  See  also 
Section  4, 3. 5. 8. 
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Figure  4-9.  Project  3 Operational  Problems  versus  Preoperational  Problems  At  Routine  Level 
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Figure  4-10. 


Project  3 Function  Level  Operational  Problems 
versus  Preoperational  Problems 


4-48 


Most  interesting  were  the  reasons  given  for  the  generally  high 
difficulty  ratings.  Complex  logic,  core  loading  problems,  and  data  inter- 
faces were  give.i  as  reasons,  but  these  reasons  were  also  applied  to  routines 
that  were  not  outliers.  A fourth  reason  was  given  for  three  of  the  nine  out- 
lier routines  which  was  not  given  for  any  other  routines,  and  that  reason  was 
"changing  requirements,"  However,  statements  of  requirements  were  not  changed 
after  the  software  requirement  specification  was  approved  and  baselined,  so 
this  points  to  changes  in  interpretation  of  poorly  stated  requireratnts.  In 
fact,  requirements  for  one  of  these  outlier  routines  continued  to  be  a point 
of  contention  well  into  the  operational  phase  of  the  Project  3 life  cycle. 
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4.2,5  Software  Problems  as  a Function  of  Routine  Size* 

One  of  our  beliefs  at  the  beginning  of  the  Software  Reliability  Study 
was  that  the  number  of  problems  encountered  in  a routine  would  correlate 
well  with  the  routine's  size  measured  in  total  source  statements.  Th<s 
was  a finding  in  an  earlier  study  done  by  TRW  for  the  CCIP-85  Study 
Group  [5],  which  stated  that  "there  is  a strong  correlation  between  the 
number  of  problems  encountered  during  testing  and  the  size/complexity  of 
the  programs  being  tested."  The  subject  project  in  that  study  was  the 
initial  version  of  Project  2. 

An  attempt  to  plot  actual  problems  ’-r  all  Project  3 routines,  Fig- 
ure 4-11,  generally  indicated  that  larger  routines  did  experience  more 
problems,  bearing  out  the  earlier  findings.  However,  in  an  attempt  to 
discover  if  certain  types  of  routines  were  more  error  prone  than  others, 
routines  were  grouped  functionally  into  subsystems  according  to  their 
position  in  the  Project  3 software  architecture,  and  correlation  improved 
considerably.  Figures  4--2  through  4-21  present  results  of  this  investiga- 
tion. Figure  4-22  presents  a summary  of  linear  regression  lines  translated 
through  the  origin  for  the  sake  of  comparison.  Results  may  be  summarized 
by  the  following: 

1)  The  correlation  coefficient,  r,  for  the  linear 
best  fit  ranged  from  a low  of  0.5408  to  a maxi- 
mum of  0.9449. 

2)  The  linear  best  fit  for  all  subsystems  fell 
roughly  between  10  and  20  problems  per  1009  total 
JOVIAL  source  statemants***with  the  exception  of 
two  routine  groupings. 

The  two  routine  groupings  which  did  not  fall  in  the  10  tc  20  problems 
per  1000  statements  rang*  were  Subsystem  D,  Figure  4-17,  and  Subsystem  F and 
Function  C2,  Figure  4-19,  both  falling  below  10  problems  per  1000  statements. 
The  reason  for  this  is  most  likely  thoroughness  of  development  vesting.**** 
Subsystem  D,  although  it  contained  the  largest  routine  in  the  Project  3 
system  (in  excess  of  2300  total  statements),  was  highly  procedural! zed 
(modularized)  computational  code  that  was  testable  in  small  segments.  Also, 
since  this  subsystem  executed  at  the  end  of  the  serial  processing  order, 
system  level  tests  of  Subsystem  0 were  not  possible  until  predecessor  tests 


* 

Section  4. 3. 3.2  at  o cq.  presents  a generalized  statistical  regression 
analysis  which  also  considers  some  of  the  data  presented  in  this  Section; 
in  particular  see  Section  4. 3. 4.1  and  Tables  4-16  and  4-17  in  regard  to 
parameter  Z,. 

** 

Groupings  by  routine  purpose  (e.g.,  computational,  input,  output,  etc.) 
are  addressed  in  Section  4. 3. 5.1. 

These  include  executable  and  non-executable  statements,  but  not  comments. 

***★ 

That  testing  done  by  the  developers  prior  to  delivery  of  the  software  to 
the  independent  test  team. 
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were  successfully  completed,  allowing  testing  at  the  routine  level  to  be 
completed.  Subsystem  F and  Function  G2  were  functionally  similar  because 
they  contained  utility  routines.  These  routines  also  received  more  thor- 
ough testing  because  their  relatively  smaller  size  made  them  easier  to 
test,  dependencies  of  other  routines  on  these  utilities  forced  testing  to 
be  completed  earlier,  and  additional  testing  was  performed  by  users  of 
these  utilities, 

Explanation  of  error  '•ates  experienced  in  the  other  Project  3 sub- 
systems is  not  so  easy,  although  such  things  as  complexity  and  difficulty, 
as  well  as  thoroughness  of  testing*,  are  believed  to  be  factors.  Examina- 
tion of  the  relationship  of  software  problems  to  other  attributes  is  pre- 
sented in  Section  4.3. 

A look  at  problems  documented  during  Project  3 operations  brings  out 
an  interesting  fact  related  to  size,  however.  By  calr.ulating  an  average 
of  errors  encountered  for  routines  in  size  groupings  it  was  noted  that 
the  large  routines,  say  greater  than  I\300  total  source  statements,  were 
the  routines  which  tended  to  experience  more  errors**  after  preoperationa! 
testing,  Figure  4-23.  Preoperationa 1 data  showed  a linear  relationship 
of  errors  to  routine  size,  i.e.,  a large  routine  was  no  more  error  prone 
than  a small  one.  Although  data  are  limited  and  further  investigation 
with  other  sources  of  data  is  recommended,  Project  3 results  suggest  that 
the  best  fit  may  not  be  a straight  line  when  operational  errors  are 
included,  Figure  4-24.  In  this  figure  light  shaded  bars  are  preopera- 
ticnal  data  and  dark  shaded  bars  are  operational  data.  This  would  sup- 
port the  contention  that  it  is  the  small  routines  which  a.e  easiest  to 
test  thoroughly  in  preopera ilonal  testing  and  the  larger  routines  that 
go  into  the  operational  environment  containing  residual,  undetected 
errors. 

Investigations  of  size  on  the  other  source  projects  was  not  fruitful, 
although  Project  2 showed  similar  results,  but  wot  so  well  correlated,  as 
those-  for  Project  3.  Project  4,  which  is  written  in  a macro  language  and 
with  only  operational  error  data,  showed  no  correlation  at  all  between 
errors  and  size. 

Project  5 has  a limit  on  routine  size,  100  executable  statements,  so 
investigations  of  errors  and  size  were  performed  by  combining  routines  to 
form  larger  modules  called  tasks.  Even  so,  there  was  very  poor  correla- 
tion. Correlations  may  improve  after  comp1etion  oF  all  Project  5 testing. 


*We  have  no  quantitative  measure  of  test  thoroughness  for  Project  3. 

★ ★ 

To  a user  it  is  often  a simple  count  of  errors  per  routine,  rather  than 
any  other  measure,  that  shapes  his  opinion  of  reliability. 
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Actual  Problems  vs.  Routine  Size 
(Project  3»  Subsystem  G,  Function 
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Figure  4-22.  Actual  Problems  vs.  Routine  Size,  Regression  Lines 
Translated  Through  Origin 
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Figure  4-24.  Preoperational  and  Operational  Errors  per  Routine 
Grouped  by  Size  (Project  3) 
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4.3  The  Phenomenological  Approach  to  Software  Reliability  . 

Some  of  the  concepts  discussed  in  this  section  were  first  proposed 
to  be  applied  to  software  by  Rubey  and  Hartwick  [6],  The  authors  of  [5] 
may  have  chosen  the  word  "Quality"  in  part  because  of  the  empiricism  of 
the  concepts,  but  clearly  the  intent  is  to  measure  and/or  deal  with  Reli- 
ability, as  defined  earlier. 

The  Reference  [6]  method  was  to  define  attributes  and  their  metrics; 
the  former  being  a prose  expression  of  the  particular  quality  desired  of 
the  software;  the  latter  a mathematical  function  of  parameters  thought  to 
relate  to  or  define  the  attribute.  The  major  attributes,  such  as,  for 
example:  "Aj  - mathematical  calculations  are  correctly  performed";  or: 

"As  - The  program  i.s  intelligible";  or:  "As  - The  program  is  easy  to 
modify,"  were  each  further  categorized  in  [6]  to  describe  less  abstract; 
i.e. , more  concrete*  attributes  capable  of  being  measured  as  to  whether 
the  attribute  is  present  to  some  degree  (on  a scale  of  0 to  1Q0).  Only  a 
few  metrics  were  defined  in  [6],  although  a detailed  breakdown  of  each 
major  attribute  was  given;  and  no  particular  application  was  mentioned  in 
the  reference. 

TRW  Systems  reported  in  [7]  on  a study  which  included  the  formula- 
tion of  metrics  and  their  application  in  a controlled  experiment  to  two 
computer  programs  independently  prepared  to  the  same  specification.  In  the 
study  only  a limited  number  of  attributes  were  considered,  primarily  those 
corresponding  to  attributes  As  and  A*  of  [6],  mentioned  previously.  It  is 
considered  that  the  application  In  [7]  was  successful  as  far  as  it  went, 
in  that  the  number  of  problems  encountered,  as  well  as  the  reliabilities 
of  each  program  (detem’ntfd  by  the  methods  described  in  Section  5.2.4) 
appeared  to  correlate  wall  with  the  metrics'  values. 

In  [1],  a more  basic  approach  to  software  reliability  in  terms  of 
attributes,  or  characteristics,  and  their  metrics  was  attempted.  This 
study  focused  primarily  upon  that  which  could  be  termed  as  good  (or  bad) 
practices  and  standards  apDlying  to  FORTRAN  coding.  The  use  of  the 
good  practice  or  adherence  to  the  specified  standard  was  defined  as  the 
value  of  the  metric,  (i.e.,  1 or  0),  although  in  many  cases  a graduation 
of  values  between  0 and  1 could  be  defined.  The  study  related  the  type 
of  practice  or  standard  to  more  abstract  characteristics  such  as  "under- 
standability,"  "testability,"  "maintainability,"  etc.  Reference  [1] 
expresses  in  great  detail  the  rationale  for  and  relationships  between  the 
primitive  characteristics,  more  concrete  characteristics,  and  the  metrics, 
together  with  a subjective  analysis  of  their  suitability  or  correlation 
with  quality,  i.e.,  high  reliability  software. 

4.3.1  Current  Approach 

Early  in  the  current  study  it  was  decided  to  make  the  phenomenological 
approach  to  software  reliability  via  a route  more  closely  related  to  that 
taken  in  Reference  [7]  rather  than  [1],  The  primary  reasons  for  this 
choice  were  the  availability  of  'TMETRIC,  a software  tool  developed  to 
analyze  characteristics  of  programs  written  in  JOVIAL  J4  language,  and  the 
unavailability  of  both  tools  and  manpower  to  accomplish  the  analysis  set 
forth  in  [1]  on  Projects  2 or  3 programs. 


Consequently  the  approach  for  this  study  has  been  to  use  'TMETRIC  to 
collect  data  on  characteristics  connoting  complexity,  and  by  implication 
error-proneness  and/or  difficulty  of  error  detection  and  correction.  The 
capabilities  of  'TMETRIC  are  described  in  Section  2 of  this  report. 

4,3.2  The  Relationshio  of  Reliability  With  'TMETRIC  Information 


The  basic  assumption  is  that  certain  measurable  characteristics  can 
be  ^elected  to  provide  a sufficiently  accurate  and  precise  indicator  of 
reliability.  To  utilize  the  numbers  of  problems  encountered  during  the 
test  program  as  an  Indicator  of  reliability,  it  must  first  be  recognized 
that-  vt  would  be  the  reliability  as  of  the  beginning  of  the  test  program, 
since  during  the  course  of  testing  the  detected  erro"$  are  corrected  (with 
high  probability).  However,  \'t  would  also  have  to  be  assumed  that  new 
errors  introduced  by  correcting  a problem,  if  detected  during  the  course 
of  the  test  program,  are  recognized  as  nr/ly  created.  These  would  not  be 
counted  ir.  the  total  number  of  problems  measuring  reliability.  Furthermore, 
it  needs  to  be  assumed  that  the  same  number  of  originally  present  errors 
would  occur  whether  or  not  other  originally  pres'.nt  errors  are  detected  and 
corrected.  A final  major  assumption  which  would  support  the  hypothesis 
that  numbers  of  problems  are  an  indicator  of  reliability  is  that  the  test 
program  uses  a set  of  test  cases  representative  of  the  operational  profile. 

Note  that  the  word  "indicator"  was  used  instead  of  "estimator."  By 
this  is  meant  that  numbers  of  problems  under  the  above  assumptions  would 
be  proportional  to  unreliability,  so  that  if  a similar  test  program  followed 
the  first  test  program,  in  which  number  of  errors  occurring  were  less  by  a 
factor  of  (say)  two,  then  it  could  be  inferred  that  the  unreliability  was 
also  reduced  by  a factor  of  two. 

Hence,  with  no  further  assumptions,  such  as  those  of  the  Shooman  or 
Jel inski  - Moranda  models  (Section  5.2)  needed  to  predict  future  reliability, 
the  numbers  of  problems  occurring  during  testing  would  only  reflect,  the  pre- 
vious reliability  of  che  software.  Furthermore,  as  observed  in  [7],  the 
ccn-plexity  parameters  of  the  software  (as  collected  and  analyzed  by  '/METRIC) 
will  remain  essentially  constant  even  though  coding  changes  are  made  to 
correct  errors.  This  means  that  if  the  values  of  the  complexity  paranieters 
have  any  meaning,  they  also  can  only  reflect  the  reliability  of  the  software 
prior  to  the  test  program,  i.e.,  that  of'  the  basic  design. 

As  a consequence  of  the  preceding  assumptions,  two  basic  models, 
similar  in  several  respects,  were  considered  for  predicting  numbers  of 
software  problems  as  a function  of  'TMETRIC  data.  The  first  model  represents 
a preliminary  attempt  to  construct  a set  of  metrics  which  both  separately 
and  together  measure  the  attribute  "complexity".  The  model  was  tested  using 
standard  linear  regression  analysis  in  order  to  determine,  primarily  by  the 
value  of  the  multiple  correlation  coefficients,  how  well  the  chosen  com- 
plexity metrics  explained  or  predicted  the  number  of  software  problems.  The 
results  of  this  analysis,  given  In  Section  4.3.3. 1,  showed  that  fair-to-good 
correlations  were  obtained. 
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On  the  other  hand,  in  constructing  a complexity  model  to  predie w 
numbers  of  software  problems,  subjective  judgements  were  made  cs  to  the 
relative  influence  of  each  of  the  (sub)metrics  (obtained  from  '7M£TR(C 
comprising  the  several  complexity  metrics  uy  directly  assigning  nunsnea; 
"weights”  or  influence  coefficients.  Secondly,  the  functioned  f5rn.s  r.husnn 
for  the  set  of  metrics  were  considered  somewhat  complex  for  their  purpose. 

The  second  model,  which  was  the  main  subject  of  investigation  «.ht$ 
study,  was  formulated  based  upon  the  philosophy  that  it  should  bo 
in  form,  and  that  the  relative  influence  of  the  various  (sub)met.  >hoylu 
be  estimateo  from  the  data.  As  a consequence  the  primary  model  chosen  tu 
relate  software  problems  to  'TMETRIC  data  was  the  well-known  lineat  rsgres*- 
sion  i-xodel 


Np  ■ 2«iW 

where  h0  is  the  num!-^  of  actual  problems*  for  the  software  module  under 
analysis,  and  f(Tj,  f.,  some  specified  function  of  the  ith  'TMETRIC  para- 
meter value  Ti • Far  e^ery  Ti,  f(Ti)  is  defined  to  be  nondecreasing  in  Ti 
so  that  increasing  the  value  of  Ti  should  result  in  greater  or  at  least 
equal  complexity.  Consequently,  it  is  worthwhile  to  constrain  the  coeffi- 
cients ai  to  be  nonnegativa;  thrt  is,  number  of  problems  should  either 
increase,  o**  at  least  not  decrease  with  increasing  value  of  any  one  com- 
plexity parameter.  One  of  the  objectives  will  then  be  to  determine  if  one 
or  more  of  the  coefficients  art  yfcal!  with  respect  to  the  remaining  coeffi- 
cients. These  would  then  be  assigned  the  value  zero,  reflecting  the  infor- 
mation that  the  corresponding  parameters  are  not  influential  in  predicting 
Np.  The  main  objective  is  to  find  a "best- fitting”  set  of  coefficients  a-j, 
(some  of  wh:ch  w<!l  be  zero)  and  then  evaluate  the  statistical  consequences 
in  term:  of  confidence  limits  on  Np.  The  choice  of  functions  f (in  most 
cases  f(x)  = x)  and  the  manner  in  which  this  general  constrained  regression 
problem  is  solved  and  applied  to  the  Project  3 data  is  the  major  topic  of 
Section  4.3,3. 

4.3.3  Analysis  of  Software  Problem  Data  Models 
4. 3. 3.1  Complexity  Model 

Initially  it  was  hypothesized  that  numbers  of  software  problems  could 
be  explained  or  predicted  from  certain  measures  of  complexity.  In  order  to 
investigate  this  hypothesis  several  factors  which  are  believed  to  contribute 
to  complexity  were  considered  and  metrics  were  defined  in  order  to  provide 
a numerical  measure  of  these  complexity  factors. 

These  categories  of  complexity  were  considered: 

1.  Logic  Complexity  - related  to  source-code  statements  of  logical 
relationships,  primarily  branching  decisions,  but  including  loop 
and  JF-nesting  level  Pleasures. 


★ 

"Actual"  problems  are  problems  (errors)  which  required  a change  to  the 
code  or  data  base  to  effect  corrective  action. 


c.  Interface  complexity  - measured  by  number  of  application  program 
interfaces  (number  of  other  routines  called),  and  number  of 
system  interfaces  (mrnber  of  system  routines  called). 

3.  Computational  complexity  - measured  by  assignment  statements 
containing  arithmetic  onerators. 

4.  Input/output  complexity  - measured  by  number  of  I/O  statements. 

5.  Readability  - measured  by  number  of  consents  statements. 

Definitions  of  Complexity  Metrics 

1.  The  Logic  Complexity  metric,  referred  to  as  Total  Logic  Complexity, 
Ltot»  cam  be  numerically  evaluated  for  each  routine  by  calculating: 

LT0T  “ LS/EX  + lL00P  + LIF  + LBR 

where 

LS  * number  of  logic  statements 

EX  * number  of  executable  statements 

’"LOOP  * a TOasurc  1°°P  complexity  defined  in  Table  4-14 

LIC  * a measure  of  IF-  condition  statement  complexity, 
also  defined  in  Table  4-14. 

Lrr  * number  of  branches  BR*.  times  0.001  (the  factor  0.001 
was  chosen  to  assign  what  was  believed  to  be  the 
relative  importance  of  number  of  branches  within  the 
expression  for  total  logic  complexity). 

The  Interface  Complexity  Metric  Cj^p  is  defined  as  follows: 

CINF  * AP  +0,5  {SYS^ 

where 

AP  * number  of  application  program  interfaces 

SYS  * number  of  system  program  interfaces 

0.5  = a factor  chosen  to  assign  what  was  believed  to  be  the 

relative  importance  of  system  progum  interfaces  to 
application  program  interfaces 


The  number  of  branches  parameter  as  used  in  the  complexity  model  this 
section  was  redefined  in  subsequent  analyses  to  include  all  branc-i-producing 
statements. 
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3.  The  Computational  Complexity  metric  CC  Is  defined  m follows: 

cc  - (c$/ex)x(isys/I;cs)xcs 

where 

CS  * number  of  computational  satements 

Lcvc  “ Et-TOT*  the  sum  over  all  routines  of  Ltot»  the  Total  % 
Logic  Complexity  for  each  routine  (denned  previously) 

£cs  ■ the  sum  over  all  routines  of  the  vrlues  of  CS  for 
each  routine 

4.  The  Input/Output  Complexity  metric  is  defined  for  each  routine 
as  follows: 

CI/0  ’ (Sl/O/EX^LSYS^SI/0^xSl/0 

where 

* number  of  Input/Output,  statements 

T. SI/n  « the  sum  over  all  routines  of  the  values  of  ST/n  for 
1/0  each  routine  ' 

5.  Readability,  an  uncomplexity  metric.  Is  defined  for  each  routine 
as  follows: 


UREAD  ■*  COM/  (TS+COM) 


where 

TS  * tutal  number  of  statements  (executable  plus  non- 
executable, exclusive  of  comments  statements) 

COM  * number  of  comments  statements 

The  Total  Complexity  Metric,  Cjqj,  i^  defined  for  each  routine  by 

ctot  = ltot  + °-1cinf  + °-2CC  + °'4Ci/n  + (-°-1)uread 

The  factors  0.1,  0.2,  0.4,  and  -0.1  applied  to  Interface  Complexity 
( C I N f ) * Computational  Complexity  (CC),  Input/Output  Complexity  (Ci/o)  and 
Readability  (Ure^q),  respectively,  represent  what  are  believed  to  be  the  . 
relative  weights  of  these  metrics  in  an  additive  model  for  total  complexity. 


Although  five  complexity  metrics  were  defined  is  predictors  for 
numbers  of  problems,  it  was  decided  at  first  to  place  emphasis  upon  the 
analysis  of  Total  Logic  Complexity  (LjOTh  * single  oredictor,  and  to 
group  the  rewining  four  metrics  into  a single  complexity  metric,  labeled 
Total-Logic.  Subsequently,  in  the  more  general  regression  analysis  des- 
cribed in  Section  4. 3. 3. 2 et  teq the  parameters  making  up  the  remaining 
metrics  were  considered  individually  as  predictors  although  not  directly 
in  the  above  form. 

Figures  4-25  presents  selected  results  of  regression  analyses  of  number 
of  actual  problems  versus  Total  Complexity  Metric  Ctot>  for  Subsystems  A, 

B,  C,  0,  E,  and  Function  61.  The  data  for  Subsystems  F and  H were  plotted 
and  found  to  be  extremely  "noisy",  as  had  been  expected  to  a certain  extent, 
since  Subsystem  F consists  of  utility  routines  and  Subsystem  H of  data 
management  routines  with  a varied  history,  some  routines  being  well  estab- 
lished and  unmodified,  others  modified  for  Project  3,  and  the  remainder 
newly  developed. 

Consequently,  plots  of  number  of  problems  versus  Cjqt  for  subsystems  F 
and  H are  not  presented.  Routines  exhibiting  anomalous  values  of  number  of 
problems  were  removed  from  the  data  before  computing  the  regression  line, 
as  indicated  on  the  plots.  Values  of  r (actually  r*)  indicate  the  fraction 
of  the  variance  of  numbers  of  problems  accounted  .’or  by  the  particular 
regression  function. 

Table  4-13  compares  correlation  coefficient:;  r of  number  of  problems 
with  each  of  the  metrics  Cjoy,  Ljqt  and  CfOT  -^TOT  for  subsystems  A,  B,  C, 

D,  E and  function  61,  The  values  of  r range  from  "fair"  to  •’good,"  in  that 
100r2  (3!)  usually  exceeds  50",  and  for  function  Gi  exceeds  80%. 

4. 3. 3. 2 Generalized  Regression  Model 

A "nonclass leal"  method  for  predicting  numbers  of  software  problems 
as  a function  of  measurable  parameters  cf  the  source  code  was  applied  to 
Project  3 data.  The  basic  technique  is  to  determine  the  coefficients  of  a 
linear  function  of  the  defined  parameters  by  minimizing  the  sum  of  squares 
of  deviations  of  the  observed  number  of  software  problems  from  the  assumed 
linear  function;  i.e. , by  "least  squares,"  or  "linear  regression."  However, 
the  distinguishing  feature  of  the  method  applied  for  this  study  is  that 
those  coefficients  corresponding  to  parameters  adjudged  to  have  a positive 
(increasing)  effect  on  number  of  software  problems  were  constrained  to  be 
nornegative.  Conversely  the  coefficients  of  those  parameters  which  would 
cause  a decrease  in  number,  of  software  problems  were  constrained  to  be  non- 
positive. These  added  judgement  factors  applied  to  the  least-squares  (or 
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Table  4-13,  Correlation  Coefficients  for  Humber  of  Software 
Problems  versus  Complexity  Metrics 
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linear  regression)  problem  converts  it  from  a relatively  simple  technique 
to  a nonlinear  programming  problem  with  a quadratic  objective  function 
(function  to  be  minimized)  and  linear  inequality  constraints. 

Statistical  analysis  of  tve  estimators  of  the  coefficients  for  non- 
linear programming  problems  (e.g.,  evaluation  of  correlations,  and  also 
confidence  limits  on  the  coefficients)  appears  to  be  essentially  untouched 
in  the  literature,  even  in  the  relatively  simple  cases  of  a quadratic 
objective  function  (as  opposed  to  some  arbitrary  convex  nonlinear  function) 
and  the  simple  nonnegative  (or  nonpositive)  linear  inequality  constraints*. 
Nevertheless,  the  estimateo  variance  of  the  observed  number  of  problems 
about  the  best-fitting  linear  function  of  a given  set  of  parameters  sup- 
plies a good  measure,  although  not  strictly  valid,  for  judging  the  goodness 
of  fit  relative  to  alternate  choices  of  parameter  sets,  and  is  used  in  this 
analysis  for  this  purpose. 

Following  the  constrained  least  squares  analysis,  subsequent  linear 
regression  analyses  without  the  previously  described  inequality  jonstreints 
were  applied  considering  only  certain  subsets  of  the  originally  selected 
parameters.  Since  the  parameters  not  considered  were  eliminated  because 
their  influence  coefficients  were  estimated  as  zeroes  in  the  constrained 
least  squares  analysis,  it  would  not  be  strictly  valid  to  make  the  usual 
inferences  obtained  from  standard  regression  analysis  methods.  However, 
if  the  reduced  sets  uf  parameters  were  the  only  ones  to  then  be  considered 
in  future  predictions,  then  subsequent  inferences  based  upon  the  standard 
regression  analysis  could  claim  more  validity.  On  the  other  hand,  should 
it  be  found  in  future  unconstrained  regression  analyses  that  one  or  more 
of  the  important  influence  coefficients  are  found  to  be  negative  when  it 
had  been  "firmly  established"  that  the  corresponding  parameter(s)  had  & 
positive  effect  on  producing  software  problems,  then  the  analysis  wnuld 
need  to  be  re-started  at  some  point,  or  else  the  theory  improved. 

It  is  also  possible  that  with  an  improvement  in  the  theoretical  deve- 
lopment of  a model  for  prediction  of  software  problems,  the  regression 
might  be,  more  appropriately,  nonlinear.  In  that  circumstance,  the  tools 
described  by  Reference  [8]  could  be  used,  in  general. 

Reference  [9]  discusses  the  constrained  least-squares  technique  and 
provides  a set  of  FORTRAN  programs**  in  its  Appendix  C for  solving  the 
problem  described  in  this  report.  Since  we  did  not  happen  to  have  these 
programs,  a set  of  programs  already  available  to  usv  which  handle  general 
nonlinear  programming  problems  based  on  Reference  [8],  were  used  for  the 
analysis. 


* 

Private  communication  by  Professor  A.  Madansky,  University  of  Chicago. 

**To  purchase  coue  and  data  in  machine- readable  form,  inquiries  should  be 
directed  to  International  Mathematical  and  Statistical  "Libraries,  Inc,, 
Suite  510,  6200  Hill  croft,  Houston,  Texas  77036. 
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The  basic  data  element  consists  of  a single  (sub) routine,  identified 
by  a label  indicating  the  portion  of  a given  function  it  performs,  and  the 
subsystem  grouping  of  the  required  functions  of  the  software.  The  data 
obtained  for  each  routine  consist  of  the  total  number  of  actual  problems 
together  with  measured  values  of  sixteen  parameters.  Table  4-14  defines 
the  parameters  used  in  the  analysis,  and  Table  4-15  presents  the  parameter 
values  for  each  routine , primarily  obtained  by  use  of  'TMETRIC,  a JOVIAL 
source  code  static  tool.  In  Table.  4-15,  two  of  the  parameters,  LlOOP  and 
lIF,  were  developed  from  the  'THE11RIC  data  by  accounting  for  nesting  level 
of  the  Loop  and  IF  statements  used.*  Also  two  parameters,  RAT  and  ViK-LD 
(programmer  rating  and  work-load  respectively),  were  developed  from  data 
obtained  from  interviews  with  management  personnel  connected  with  Project 
3.  Data  for  these  two  parameters  were  unobtainable  for  subsystems  6 and 
H»  however.  All  parameters  except  COM  (number  of  convnents)  and  RAT  were 
considered  as  having  a positive  effect  on  number  of  software  problems.  For 
simplicity  in  the  computation,**  the  latter  two  parameters  were  given  minus 
signs,  and  in  this  way  all  coefficients  could  be  constrained  to  be 
nonnegative, 

4. 3. 3. 3 Analysis  of  the  Generalized  Regression  Model 

Assume  there  are  N parameters  Zi,  j = 1.  2,  ...  , t<  and  K routines 
in  a specified  grouping.  Yj  is  the  (theoretically)  expected  number  of 
software  problems  in  the  ith  routine,  i 3 1,  2,  ....  K. 

The  expected  number  of  software  problems  is  assumed  to  be  expressible 
as  a linear  function  of  the  parameters: 


Yi  * E *Ai 


j=i 


(i)*** 


♦Calculated  values  of  LloOP  AND  Ljp  are  shown  multiplied  by  iOOO,  in 
order  to  avoid  convergence  problems  in  the  estimation  process. 

**J.t  is  an  automatic  feature  of  the  nonlinear  programming  problem  computa- 
tional software  that  all  solutions  are  constrained  to  be  nonnegative, 
unless  explicitly  programmed  otherwise. 

***Note  that  in  this  model  a constant  term  (Z0j  = 1.0)  is  not  used;  i.e., 
it  is  postulated  that  zero  problems  would  occur  if  the  parameters  were 
all  zero.  In  other  words,  the  regression  plane  is  being  forced  to  go 
through  the  origin. 


Table  4»14.  Parameter  Definition' 


ROUT 

PROBS 

Z,  * TS 
Z=  = LLOOP  * 


* 

Values  were 


Coded  routine  idenifier.  Each  routine  has 
on  identifier  which  replaces  its  real  name. 
This  identifier  indicates  the  routine's 
parent  subsystem  end  function  as  well. 


A 

Subsystem 

Function 

Routine  


01 


Number  of  actual  problems  encountered  in  the 
routine.  Actual  problems  are  those  that 
required  an  update  to  the  routine's  code. 

Total  routine  statements  [’S  = NEX  + EX] 


LL  E3  Computed  loop  complexity*  for  the  routine 

according  to  the  following  equation. 

LL00P  * ^miWi 
where: 


sothat  SWi" 


and: 


mi  = number  of  loops  in  routine  at 
indenture  or  nesting  level  i. 

W.  = weighting  factor 

Q = maximum  level  of  indentures  in  the 
system 

4 = shaping  value 


multiplied  by  1000  as  a scaling  factor. 
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Table  4-14.  Parameter  Definitions  (Continued) 


Z,  = 


Llp  * IF  '•3 


Computed  If  complexity*  for  the  routine 
according  to  the  following  equation 


lifb 


&1wi 


Z„  = BR 
Z5  = LS 
Z6  = AP 

Z^  = SYS 

t 


z8  = I/O 
Zg  » COMP 
Z10  = DAT 
Zu  = NEX 
Z12  - EX 


Zi3  = TI 


where: 

n,  = number  of  IFs  in  routine  at  inden- 
1 ture  or  nesting  level  i 

W,.  s weighting  factor  (the  same  as  for 

1 i 1 
’‘LOOP ' 

Total  routine  branches 

Routine  logical  statements  (IF,  ORIF,  IFEITH) 

Direct  routine  interfaces  with  other  applica- 
tions routines  (not  a count  of  calls  to  other 
routines) 

Direct  routine  interfaces  with  operating 
system  or  system  support  routines  (not  a count 
of  calls  to  system  routines) 

Routine  input/output  statements 

Routine  computational  statements 

Routine  data  handling  statements 

Routine  nonexecutable  statements 

Routine  executable  statements 

Total  routine  interfaces  with  other  routines 
VI  = AP  * SYS 


Zl4  « C0'\ 


Z15  = RAT 


Zl6  = WK-LD 


Total  routine  comments.  Comments  are  not 
included  in  the  count  of  nonexecutable  state- 
ments, NEX. 

Average  programmer  rating.  This  parameter  is 
an  average  based  on  the  ratings  of  each  pro- 
grammer who  worked  on  the  routine. 

Average  workload  of  programmers  who  worked 
on  the  routine. 


Values  were  multiplied  by  100C  as  a scaling  factor. 
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Table  4-15.  Project  3 Routine  Parameters  by  Subsystem  (Continued) 


where  Zh  is  the  value  of  Zj  for  the  ith  routine  and  aj  is  the  'influence 
coefficient"  for  the  jth  parameter.  The  observed  number  of  software  prob- 
lems for  the  ith  routine  is  denoted  by  Y^  and  the  methodology  for  estimating 
the  aj  is  to  minimize  the  sum  of  squares  of  the  deviations  of  the  observed 
from  their  expected  values,  Y^,  i.e.,  minimize 

Q - £ (Yrtya  (2) 


(3) 


The  "influence  coefficients"  a^  are  constrained  to  be  nonnegative: 

aj  iO  j * 1,  2,  ...»  N (4) 

consequently  the  problem  of  evaluating  the  aj  becomes  a nonlinear  programm- 
ing problem  with  quadratic  objective  function  Q given  by  (3)  and  inequality 
constraints  (4). 

Although,  as  pointed’out  in  Section  4.3. 3.2,  the  well-established 
statistical  methods  of  regression  analysis  do  not  apply  when  inequality 
constraints  are  applied  to  the  estimators,*  the  usual  standard  error  may 
still  be  calculated  and  usid  as  a measure  of  the  closeness  of  fit  of  Y 
to  a linear  function  of  the  specified  parameters. 

The  standard  error  of  Y estimates  the  square  root  of  the  residual 
variance  of  Y,  which  is  a measure  of  the  greatest  closeness  of  fit  that 
nay  be  obtained  when  representing  Y as  a linear  function  of  the  parameters 
Z j , . . . , ZN.  Its  formula  is 


8 * [Qm,n/(K-N')]’s  (5) 

where  Qmin  is  the  value  of  Q in  (3)  obtained  by  substituting  in  the  values 
aj  which  minimize  Q,  and  N'  is  the  number  of  nonzero  aj. 

When  Y-j  is  assumed  to  be  a linear  function  of  only  one  parameter  (say) 
ZR,  the  solution  is  particularly  simple,  and  we  have 


? v,V 

i=l  1 K1 


K 

L 

i=l 


'ki 


(6) 


The  basic  reason  is  that  the  joint  distribution  of  the  estimators  have 
positive  probability  over  regions  which  have  (potentially)  irregular 
boundaries,  and  consequently  the  analysis  would  become  much  more  intricate. 
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and 


£ (Y,  - ikZkf)Z/{K-l) 


1% 


(7) 


In  the  single  parameter  case,  the  estimator  aj<  is  guaranteed  to  be  positive 
(negative)  since  each  Yj  is  nonnegative  and  ail  are  positive 
(negative)*,  so  that  ordinary  regression  methods  apply* 

The  correlation  coefficient  r^  will  be  used  as  a measure  of  associa- 
tion of  Y and  the  single  parameter  Z|<.  The  value  of  rj<  , sometimes  called 
the  “coefficient  of  determination,"  measures  the  fraction  of  the  total 
variance  of  number  of  problems  Y accounted  for  by  the  particular  regression 
function  shown.  The  correct  formula  for  f|<  when  the  regression  line  is 
forced  through  the  origin  is 


r,.  * a, 


1 


(8)** 


■ i: 

*} 

p 


% 


(where  all  summations  are  from  i * 1 to  K) . One  must  be  careful  in  using 
other  formulas  for  the  correlation  coefficient,  as,  for  example 


A A 

rk“*k 


£zki  " K<£zk1* 


2X 


(9) 


i - K(Df) 

which  is  not  valid  when  the  regression  line  is  forced  through  the  origin.*** 

During  the  course  of  the  analysis  the  parameter  subsets  for  the  linear 
regression  function  were  changed,  as  discussed  below. 

Initially  all  sixteen  parameters  were  considered,  even  though  some 
were  expressible  as  linear  combinations  of,  or  were  strongly  related  to, 
other  parameters.  For  example,  total  number  of  statements  equals  the  sum 
of  the  numbers  of  executable  statements  plus  nonexecutable  statements 
(TS  * EX  + NEX).  Also,  total  number  of  interfaces  equals  the  sum  of  the 
numbers  of  application  program  interfaces  plus  system  program  interfaces 
(TI  = AP  + SYS).  Subsequently,  a subset  of  ten  parameters  considered  to 
be  essentially  unrelated  were  selected  for  re-analysis.  These  were  IlooP 
Ljp,  BR,  AP,  SYS,  I/O,  COMP,  DATA,  NEX  and  COM.  Although  also  independent 
the  programmer- related  parameters  RAT  and  WK-ID  were  deleted  from  consider- 
ation at  this  point,  since  values  for  these  parameters  were  not  available 
for  all  routines.  Thus,  following  the  analyses  using  the  full  set  of 


The  parameters  COM  and  RAT  are  given  minus  signs  as  noted  previously. 

Searle,  S.R.,  Linear  Models,  John  Wiley  and  Sons,  Inc.,  New  York,  1971, 
pp.  95-98. 

Oowker,  A.  II.  and  Liebermao,  G.J.,  Engineering  Statistics,  Prentice- 
Hall,  Inc.,  1959,  p.  274. 
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parameters  (Phase  I)  in  which  the  influence  coefficients  were  constrained 
to  be  nonnegative  (nonpositive  for  COM  and  RAT[ , a stundard  linear  regression 
analysis  (Phase  II)  was  performed  on  the  independent  parameters  with  no 
constraints  on  the  influence  coefficients;  however  the  regression  plane 
was  forced  through  the  origin. 

Subsidiary  analyses  were  also  performed.  These  included  a linear 
regression  analysis  of  number  of  problems  as  a function  of  any  single 
parameter,  in  order  to  compare  the  benefits  of  using  several  parameters 
versus  one  parameter  for  predicting  number  of  problems.  Secondly,  for  those 
of  the  ten  independent  parameters  which  appeared  to  be  significant  (in 
having  nonzero  influence  coefficients)  the  residuals,  or  differences  between 
the  observed  number  of  problems  and  the  number  calculated  from  the  regres- 
sion function,  were  plotted  in  order  to  evaluate  the  chosen  model  or  deter- 
mine, if  possible,  changes  to  the  model  (e.g.,  quadratic  instead  of  linear). 
Third,  a scheiue  of  smoothing  out  the  sing*'  parameter  linear  regression  by 
systematically  eliminating  "outlier  points",  or  routines  that  showed  excessive 
statistical  deviations  from  the  predicted  number  of  problems,  was  applied, 
in  nearly  all  cases  an  explanation  for  the  anomalous  behavior  could  be 
found  based  upon  independent  data,  including  an  intimate  knowledge  of  the 
routine  and  circumstances  surrounding  its  development  and  test.  Fourth, 
an  analysis  was  performed  of  several  typr.s  of  software  problems,  each  type 
being  considered  as  a function  of  the  parameter  considered  to  be  the  best 
metric  for  the  corresponding  attribute  (as  labeled  by  type  of  problem). 

Thus,  for  example,  numbers  of  interface  problems  were  considered  to  be 
a linear  function  of  total  number  of  interfaces  (Z13);  etc. 

Th'i  subsequent  sections  present  the  data,  analysis  and  discussion  of 
the  results,  as  well  as  suggested  hypotheses  to  be  tested  with  new  data. 

4.3.4  Description  of  Regression  Analyses  Results  (Phase  I) 

4. 3.4.1  Data  Grouped  by  Subsystem 

Table  4-16  shows  the  computed  influence  coefficients  and  the  standard 
error  5,  using  the  full  set  of  parameters  when  the  data  are  grouped  by  sub- 
system. As  noted  before  all  sixteen  (16)  parameters  (the  independent 
variables)  were  used  for  evaluation  of  subsystems  A,  B,  ...,  F,  but  only 
fourteen  (14)  were  available  for  subsystems  G and  H.  Table  4-17  shows  the 
influence  coefficients  and  standard  errors,  8,  for  the  cases  when  number 
of  software  problems  Y-j  is  proportional  to  only  one  parameter,  for  each 
parameter.  Correlation  coefficients  (Eq.  (8))  are  shown  only  for  those 
parameters  whose  influence  coefficients  are  non-zero  in  Table  4-16. 

In  all  cases,  except  for  subsystem  E,  it  can  be  noted  that  the 
standard  error  0 when  the  full  set  of  parameters  is  considered  (Table  4-16) 
is  less  than  even'  standard  error  when  one  parameter  alone  is  assumed  to 
be  the  determiner  of  the  number  of  software  nroblsms  (Table  4-17).  In  other 
words,  except  for  subsystem  E data,  in  no  case  does  a single  parameter  allow 
prediction  of  number  of  problems  with  more  precision  than  does  the  com- 
bination of  parameters  for  each  software  subsystem  as  given  in  Table  4-16. 
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In  the  case  of  subsystem  E,  the  least  squares  solution  is  suspect  since 
only  14  observations  are  available.  For  the  same  reason  the  solutions 
for  subsystems  0 and  B should  also  be  regarded  with  caution.  The  values 
a**e  given  in  Table  4-16,  nevertheless,  since  the  iteration  procedure  to 
solve  the  constrained  least  squa:  es  problem  converged.  Possibly  the 
reason  for  convergence  is  that  if  only  the  parameters  for  which  nonzero 
coefficients  were  obtained  had  been  considered,  the  solutions  for  the 
coefficients  would  have  beun  the  same  as  given  in  Table  4-16.  Thus  nine 
coefficients  would  have  bean  estimated  from  14  observations,  which  is 
adequate.  However  the  fact  that  for  subsystem  E alone,  five  of  the  16 
parameters  showed  single  variable  standard  errors  of  prediction  less  than 
that  of  the  best  linear  prediction  considering  all  of  the  variables,  is 
sufficient  justification  to  discard  the  solution  for  subsystem  E. 

It  is  apparent  by  inspection  of  Table  4-16  that  different  sets  of 
parameters  combine  to  give  best  linear  predictors  of  numbers  of  software 
problems  for  the  eight  subsystems  of  Project  3.  For  example  the  best  com- 
bination of  parameters  for  subsystem  G are  the  three  parameters  Zi,,  Zg, 
and  -Z jt,  (number  of  branch  statements,  number  of  computational  statements, 
and  number  of  comments). 

The  number  of  errors  for  subsystems  A,  B and  E are  best  predicted  by 
linear  functions  with  nine  (9)  parameters  as  shown  in  Table  4-16,  generally 
different  sets  of  parameters  for  each  subsystem. 

On  the  other  hand,  the  parameters  Zt  (Total  Statements)  and  Z2 
(weighted  loop  nesting  level  parameters)  are  not  contained  in  any  of  the 
sets  of  best  predictors.  In  the  case  of  Z»,  an  explanation  would  be  simply 
that  the  components  making  up  the  total  number  of  statements,  when  separately 
weighted  and  then  added  together  constitute  a better  predictor  than  an 
unweighted  sum  of  such  components.  This  is  not  to  say  that  the  parameter 
Z]  is  a bad  predictor,  because,  as  shown  in  Table  4-17,  the  standard  error 
of  predicted  versus  observed  errors  when  treating  l\  as  a single  predictor 
is  smaller  than  most  of  the  standard  errors  associated  with  other  variables. 

A similar  explanation  for  the  parameter  Z2  not  appearing  in  any  set 
of  best  predictors  in  Table  4-16  is  not  available.  Table  4-17  also  indi- 
cates by  its  associated  standard  errors  being  relatively  large,  that  Z2 
simply  nay  not  be  a good  predictor  for  number  of  errors. 

The  parameters  Zn  (nonexecutable  statements)  and  Zj2  (executable 
statements)  each  appear  in  only  one  set  of  nredictors,  possibly  for  the 
same  reason  as  given  for  Zi,  and  also,  since  Zn  appears  in  the  set  of 
predictors  for  the  (anomalous)  subsystem  E,  this  fact  should  be  ignored. 

At  the  other  end  of  the  spectrum,  parameters  Zg  (application  program 
interfaces)  and  Zlt,  (comments)  appear  in  all  but  one  of  t.he  best  predictor 
parameter  sets,  indicating  a relatively  higher  predictive  capability  than 
other  parameters.  As  single  parameter  predictors,  however,  Table  4-17  does 
not  indicate  that  Z$  or  Z^  have  outstanding  capabilities,  their  associated 
standard  errors  varying  over  a wide  range  of  their  rank  order  for  each 
subsystem  as  discussed  in  the  next  paragraph. 
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From  Table  4-17,  the  standard  errors  associated  with  each  parameter 
may  be  ordered  from  the  smallest  to  the  largest,  as  shown  in  Table  4-18. 

As  a tentative  wans  of  evaluating  relative  merit  of  single  parameter  pre- 
dictors, if  the  lowest  to  the  fourth  lowest  standard  deviations  are  assigned 
scores  of  4,  3,  2,  i points  respective,  then  the  three  highest  scoring 
parameters  are  Zjq,  Zj2,  and  Zt. , with  13,  13,  and  10  points,  respectively. 
These  scores  are  not  significantly  higher  than  those  of  some  other  param- 
eters, however.  On  the  other  hand,  if  one  had  to  choose  a single  predictor 
for  number  of  errors,  it  would  probably  be  Zi2  (executable  statements)  as 
being  the  simplest  to  evaluate.  However,  as  discussed  in  the  next-to-last 
paragraph,  ZI2  appeared  In  only  one  set  of  predictors  (Subsystem  H, 

Table  4-16)  possibly  because  of  its  being  better  represented  by  a weighted 
sum  of  other  parameters. 

Table  4-18 

Subsystem  Parameters  in  Order  of  Standard  Deviation  of  Prediction 


A 

12, 

10, 

4, 

1, 

5, 

11, 

13, 

8, 

3, 

6, 

14, 

9, 

7, 

16, 

?-5, 

2 

8 

7, 

13, 

6, 

15, 

5, 

10, 

12, 

16, 

1, 

9* 

8, 

11, 

14, 

4, 

2, 

3 

r. 

10, 

12, 

1, 

5, 

11, 

4, 

9, 

14, 

13, 

7, 

8, 

6, 

2, 

15, 

16, 

3 

0 

13, 

5, 

16, 

7, 

4, 

6, 

15, 

11, 

1, 

12, 

3, 

10, 

6, 

9, 

2, 

14 

E 

4, 

10, 

U, 

1, 

12, 

3, 

5, 

2, 

8, 

7, 

9, 

13, 

14, 

16, 

6, 

15 

F 

6, 

10, 

1, 

12, 

11, 

14, 

5, 

4, 

13, 

9, 

3, 

15, 

16, 

8, 

7, 

2 

G 

4, 

9, 

14, 

12, 

5, 

1. 

11, 

10 

3, 

13, 

2, 

6, 

7, 

8 

li 

12, 

5, 

1, 

11, 

14, 

9, 

10, 

4, 

6, 

13, 

3, 

7, 

2, 

8 

3asis  for  Subsequent  Analysis 


The  preceding  discussion  was  based  upon  (1):  consideration  of  sixteen 
measurable  software  parameters,  end  (2):  the  parameter  data  being  grouped 

by  subsystems.  Since  some  of  the  sixteen  (16)  parameters  were  known  to  be 
functionally  related  to  others,  it  was  decided  to  select  a subset  of  ten 
(10)  parameters  considered  as  unrelated,  and  use  these  as  the  basic  data. 
Secondly,  some  additional  meaningful  groups  of  routines  were  defined  by 
classifying  them  by  principal  function  performed.  These  were  Control 
(group  label  CON),  Input  Processing  (INP),  Output  Processing  (OUT).  Pri- 
marily Computational  (PC),  Set  up  or  initialization  (SET),  Utilities  (UTL), 
and  Post-Processing  (PP).  The  purpose  of  defining  these  groupings  is  based 
upon  the  assumption  that  the  same  functions  cculd  be  defined  generally 
for  other  software  systems  and  for  which  the  routines  belonging  to  them 
would  show  similar  software  problem  activity. 

An  additional  set  of  routine  groupings  was  defined  in  terms  of  the 
original  subsystems.  These  groups  are  labeled  Aj,  A2,  ...,  H2,  where 
for  example  Aj  denotes  the  collection  of  routines  labeled  A10i»  Ajo2>  Aio*, 
and  A^,  as  defined  in  Table  4-15,  The  parameters  for  the  group  labeled 
Aj  were  derived  by  summing  the  values  of  the  corresponding  parameters  for 
each  of  A^i,  ....  Ajoi,,  and  similarly  for  A2,  ....  H2.  This  grouping  is 
given  the  label  "U"  consisting  of  twenty-five  (25)  (super)  routines  or 
functions. 
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It  was  also  decided  to  perform  a standard  (m,  constraints)  linear 
regression  analysis  after  having  deleted  from  consideration  those  para- 
meters w’  ,e  influence  coefficients  were  found  to  be  zyro,  for  each  sub- 
system ana  each  functional  grouping  mentioned  previously.  The  primary 
reasons  for  doing  this  ware  (1):  to  check  whether  *'he  standard 
(unconstrained)  linear  regression  analysis  would  yield  regression  coeffi- 
cients for  the  unscreened  parameters  identical  with  those  obtained  using 
the  contained  analysis;  (2):  for  convenience,  in  that  the  standard 
linear  regression  computer  programs  compute  correlation  coefficients t as 
well  as  the  multiple  correlation  coefficient  (of  number  of  problems  with  the 
linear  regression  function);  and  (3):  to  statistically  determine  the  close- 
ness o*  fit  of  the  regression  function  to  the  number  of  software  problems 
by  testing  the  observed  multiple  correlation  coefficient  for  significance. 

4.3.5  Description  of  Regression  Analysis  Results  (Phase  II) 

4.3.5. 1 Special  Function  Groupings 

Table  4-19  lists  the  parameters  for  each  of  the  function  groupings, 
labeled  Aj,  A2,  ....  H2.  Programmer- related  parameters  Z^s,  Zjg,  are  not 
included  in  any  of  the  groupings,  since  these  data  were  not  availab  e for 
groupings  G;,  G2,  Hj,  H2.  Table  4-20  gives  the  computed  influence  coeffi- 
cients for  the  best  linear  predictor  for  numbers  of  errors  and  the  standard 
error  of  prediction  using  the  constrained  least  squares  analysis. 

Table  4-21  gives  the  labels  for  routines  adjudged  to  belong  to  selected 
groupings  CON,  INP,  OUT,  PC,  SET,  UTL,  and  PP.  The  parameter  data  are  as 
previously  given  in  Table  4-15.  Since  the  selected  group  PP  contains  only 
two  routines,  no  further  analysis  was  performed  for  this  particular  group. 

Table  4-22  presents  the  computed  influence  coefficients  and  standard 
errors  for  the  selected  groupings  CON,  INP,  OUT,  PC,  SET  and  UTL,  using  the 
constrained  least  squares  analysis  with  all  sixteen  parameters  present  (note 
that  the  above  groupings  did  not  include  routines  from  subsystems  G nor  H, 
for  which  on^y  fourteen  parameters  were  available).  Group  INP  contains  only 
eight  (8)  routines,  which  apparently  did  not  lead  to  any  difficulty  in  the 
constrained  least  squares  prediction  process,  and  resulted  in  six  (6)  *. 'nzero 
coefficients*.  Group  SET  contains  only  sixteen  routines,  ard  resulted  in 
four  M)  nonzero  coefficients.  Possibly  the  results  for  INP  should  be  dis- 
regarded, but  the  remaining  results  in  Table  4-22,  including  those  for  SET, 
appear  satisfactory.  However,  no  further  analysis  was  performed  subsequently 
for  the  grouping  INP. 


■* 

Possibly  as  long  as  there  are  fewer  nonzero  coefficients  than  number  of 
routines,  a solution  will  be  produced  by  the  nonlinear  programming  program 
used. 


Table  4-20.  Influence  Coefficients  fo»"  Software  Reliability 
Parameters  Derived  from  Function  Groupings  "IP* 
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Table  4-21.  Selected  Function  Groupings  by  Routine  f.’ur.ber  (Project  3) 
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4. 3.5.2  Reduction  to  Ten  Unrelated  Parameters 


Some  of  the  sixteen  original  parameters  are  functionally  related,  as 
mentioned  previously.  Consequently  the  ten  parameters  l-L0QP»  Ljp,  BR,  AP, 
SYS,  I/O,  COW5,  DATA,  NEX,  and  COM  were  selected  for  further  analysis. 

Another  reason  for  selecting  a non functionally  related  subset  of  parameters 
at  this  point  is  mat  standard  linear  regression  analyses  (no  constraints) 
will  not  work  when  two  or  more  of  the  controllable  or  "independent"  variables 
are  linearly  dependent.* 

4. 3.5. 3 Analysis  Using  the  Ten  rameters  (Reduced  Set) 

At  this  point  the  analysis  on  the  data  Project  3 was  restarted  using 
the  ten  (hypothesized)  independent  parameters,  referred  to  as  the  "reduced" 
set  of  parameters.  Table  4-23  prerents  the  computed  influence  coefficients 
■'.nd  the  standard  error  using  the  reduced  set  of  parameters  for  subsystems 
A - H,  Functions  U,  and  special  grouping  COM,  ....  UTL,  defined  previously, 
and  using  the  constrained  least  squares  method.  The  values  of  the  influence 
coefficients  and  standard  errors  can  be  compared  directly  with  those  of 
Tables  4-16,  4-19  and  4-22,  respectively  (the  latter  tables  give  the 
results  of  the  constrained  least  squares  regression  analysis  for  the  full 
set  of  parameters). 

As  can  be  seen  by  comparison  of  the  results  fer  the  reduced  set  of  para- 
meters with  those  for  the  full  set  of  parameters,  the  coefficients  showed 
a similar  behavior,  being  either  simultaneously  zero  in  nearly  all  cases, 
or  nearly  of  the  same  magnitude  when  nonzero.  For  example,  subsystem  b 
shows  a nonzero  coefficient  for  L-IF  (83  3 0.426)  when  the  full  set  o*  para- 
meters are  considereo,  but  83  = 0 for  the  reduced  set  of  parameters.  Also 
for  subsystem  B and  F,  Functions  U,  and  groupings  C and  SET,  84  shows  the 
opposite  behav^'or,  being  zero  for  the  full  set  but  positive  for  the  reduced 
set.  The  latter  is  also  true  for  in  subsystem  H,  and  ajo  in  Group  PC. 
These  differences  are  generally  not  significant  however,  since  when  84  is 
positive  it  is  still  fairly  small  and  its  contribution  to  the  total  number 
of  problems  (when  multiplied  by  the^number  of  branches)  is  not  significant. 
For  S3,  the  value  of  the  parameter  Z3  is  often  very  small,  so  again  83Z3 
does  not  significantly  contribute  to  the  total  number  of  problems.  Since 
ajo  for  the  reduced  set  of  parameters  for  PC  is  relatively  small,  the  same 
statement  can  be  made  in  that  case.  For  subsystem  H,  the  discrepancy  between 
an  for  t,ie  reduced  set  and  for  the  full  set  appears  more  significant,  how- 
ever, When  the  standard  regression  analysis  is  discussed,  the  statistical 
significances  of  the  influence  coefficients  are  included,  which  further 
aids  in  screening  out  the  less  important  parameters  from  the  prediction. 


The  matrix  of  the  controllable  variables  has  a vanishing  determinant  and 
therefore  cannot  be  inverted  in  attempting  to  solve  for  the  coefficients. 
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4. 3. 5.4  Analysis  of  a Further  Reduced  Set  of  Parameters  I 

A standard  linear  regression  analysis  was  applied  to  a set  of  parameters  1 

of  each  subsystem  or  grouping  whose  influence  coefficients  were  not  zero 
from  the  previous  analysis  (these  parameters  can  be  directly  read  in 
Table  4-23  as  those  with  nonzero  values  of  a|<).  The  analysis  was  nonstandard 
in  the  following  respect,  however:  the  regression  plane  was  forced  to  go 
through  the  origin.  The  main  reason  for  this  choice  of  a linear  regression 
model  without  a constant  term  is  to  eliminate  ronstructural , or  undefined 
parameters  from  the  model.  This  is  based  upon  perhaps  optimistic  assumptions 
that  the  linear  regression  model  and  the  selected  parameters  are  adequate 
for  predicting  numbers  of  problems.  To  some  extent,  as  discussed  in  Section  ! 

4.3. S. 6,  this  assumption  has  been  examined  by  plotting  residuals  (observed 
minus  predicted  numbers  of  problems)  against  the  essential  parameters. 

Table  4-24  summarizes  the  results  of  the  standard  regression  analysis. 

Presented  in  the  table  is  the  standard  error  of  the  regression  function  for 
the  "further"  reduced  set  of  parameters  for  each  grouping  of  routines  A,  .... 

H,  U,  and  CON,  ...»  UTL,  defined  previously.  Since  the  standard  errors  have 
the  dimension  "number  of  problems",  they  should  not  be  compared  for  different 
groupings  of  routines,  but  only  for  different  choices  of  the  set  of  para- 
meters used  as  predictors  for  a given  grouping.  The  multiple  correlation 
coefficient,  r,  is  then  given  and  in  all  cases  was  found  highly  significant; 
i.e.,  nonzero  with  high  probability,  based  upon  the  usual  normal  distribution 
assumptions.  The  values  of  r range  from  slightly  below  0.9  to  0.983,  indi- 
cating that  from  about  80  to  96  percent  (100  r2  %)  of  the  variance  of  *umber 
of  problems  is  accounted  for  by  the  selected  linear  regression  function. 

4. 3. 5. 5 Development  of  a Minimized  Set  of  Parameters 

The  particular  technique  used  for  standard  linear  regression  analysis 
also  attempts  to  find  a minimized  set  of  parameters  such  that  the  multiple 
correlation  coefficient  (standard  error)  is  not  reduced  (increased)  signi- 
ficant"^. This  minimization  process  results  in  the  "minimized  set  of 
parameters"  whose  subscripts  are  shown  in  Table  4-24,  as  well  as  the  • 

regression  coefficients  at  for  each  minimized  set  for  each  grouping  of  , 

routines,  the  new  (slightly  increased)  standard  error,  and  the  new  (slightly 
reduced)  multiple  correlation  coefficient. 

In  general,  the  analysis  results  showed  that  the  alleged  independent 
parameters  were  essentially  independent  by  their  correlation  coefficients 
being  =0.75  or  less.  However,  parameters  Z4  (number  of  branches)  and  Zjq 
(number  of  data  handling  statements)  when  in  the  analyzed  set,  almost 
consistently  showed  a high  correlation  coefficient  (0.92  - 0.99),  This  is 
also  reflected  in  the  process  to  select  a minimized  set  of  parameters,  in 
that  in  no  case  do  parameters  Z4,  1\q  occur  together  in  a minimized  set  * 

(refer  to  Table  4-24).  Another  way  of  interpreting  these  observations  is 
that  if  one  of  the  parameters  Z4  or  Ziq  is  present  in  a set  of  predictors 
for  numbers  of  problems,  then  the  other  will  add  no  new  information,  or  in 
other  words  will  not  reduce  the  standard  error  of  regression,  if  included; 
in  fact,  in  some  cases,  the  inclusion  of  both  Z4  and  Z^q  could  even  lower 
the  prediction  capability  of  the  regression  function,  as  compared  to  each 
parameter  being  used  separately.  Further,  the  high  correlation  will  also 
.iffect  the  appearance  of  residual  plots,  as  noted  in  Section  4. 3. 5.5. 
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4. 3.5.6  Residuals  Analysis 

The  analysis  used  also  tests  the  residuals  (observed  minus  predicted 
number  of  problems)  for  normality,  which  if  not  rejected,  justifies  the 
significance  tests  on  the  correlations  obtained  in  the  analysis.  On  the 
other  hand,  were  the  normality  hypothesis  to  be  rejected.  It  could  indicate 
that  the  linear  model  was  in  some  sense  not  suitable.  In  all  cases,  except 
for  subsystem  A,  the  hypothesis  of  normality  was  not  rejected.  Figures  4-26 
through  4-29  show  plots  of  residuals  for  subsystem  A,  versus  U (number  of 
branches),  Z6  (number  of  application  program  interfaces),  Zm  (nutber  of 
I/O  statements,  and  Zjq  (number  of  data  handling  statements),  respectively. 
In  general,  the  residuals  appear  to  increase  in  absolute  value  as  the  para- 
meter increases  (fan  out  from  the  origin),  implying  that  the  variance  of 
number  of  problems  increases  with  the  parameter,  rather  than  remains  con- 
stant (as  is  assumed  In  standard  regression  analysis).  On  the  other  hand, 
as  will  be  shown  later,  nearly  all  of  the  high  positive  residuals  (observed 
minus  predicted)  represent  "outliers",  in  that  the  corresponding  routine 
exhibits  anomalous  statistical  behavior  for  many  of  the  parameters  including 
thesa  particular  ones.  However,  the  large  negative  residuals,  in  general, 
are  net  outliers.  Consequently,  if  departure  from  a constant  is  assumed, 
the  residuals  may  be  showing  a decrease  to  negative  values  for  increasing 
values  of  the  parameter,  indicative  of  high  correlation  between  two  cr 
more  of  the  parameters.* 

These  conclusions  are  tentative  however,  and  based  upon  a very  limited 
examination  of  residual  plots,  and  due  to  lack  of  time  and  resources,  this 
pait  of  the  investigation  could  not  be  pursued  further. 

The  results  of  the  standard  linear  regression  analysis  can  be  suimwrized 
briefly  by  stating  that  a certain  small  number  of  parameters  may  be  used  to 
effectively  predict  numbers  of  software  problems  occurring  in  development. 
These  parameters  are  Z4  (number  of  branches);  Z5  (number  of  application  pro- 
gram interfaces);  Zg  (number  of  computational  statements);  Z*| 0 (number  of 
data  handling  statements;  although  since  Z4,  Z10  nearly  always  are  highly 
correlated,  only  one  should  be  used). 

4. 3. 5. 7 Evaluation  of  Internal  and  External  Complexity  Metrics 

Since  both  Z4  and  Zg  appear  to  be  good  predictors  jointly,  it  was 
decided  to  determine  their  overall  performance  by  constructing  a linear 
regression  analysis  with  just  these  two  parameters,  being  evaluated  for 
each  subsystem  A,  ....  H.  Furthermore  it  is  intriguing  to  hypothesize  that 
total  numbers  of  problems  can  best  be  predicted  (or  explained)  by  a combi- 
nation of  routine- internal  complexity,  as  measured  by  number  of  branches, 
and  routine-external  complexity,  as  measured  by  number  of  application 
program  (other  routine)  interfaces  with  the  given  routine. 

Table  4-25  presents  the  numbers  of  problems  for  each  subsystem,  Y,  total 
number  of  b'-anches,  Z4,  and  total  nunber  of  application  program  Interfaces, 
Ze,  each  obtained  by  summing  these  quantities  over  all  of  the  routines  in 
the  subsystem.  Table  4-26  summarizes  the  regression  coefficients,  their 
standard  deviations,  the  multiple  correlation  coefficient  and  the  standard 
error  of  regression. 


★ 

Private  communication  by  N.  R.  Garner  of  TRW. 
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Figure  4-29.  Residuals  vs. 
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Table  4-25.  Sunned  Data  for  Project  3 Subsystems 


Subsystem 


267 

5814 

238 

200 

2330 

168 

488 

7823 

287 

100 

2926 

117 

144 

2618 

101 

80 

3054 

81 

260 

4550 

C30 

467 

7434 

296 

Table  4-26.  Statistical  Data  Summary  for  Parameters  Z^  and  Zg 


V 

Parameter 

Regression 

Coefficient 

Standard 

Deviation 

Standard 

Error 

Multiple 

Correlation 

Coefficient 

a4  - 0.0454 

0$  - 0.0144 

h 

ag  » 0.254 

= 0.315 

54.4 

0.982 

It  will  be  noted  that  the  standard  deviation  Sg * 0.315  is  large  compared 
io  $£,  *nd  in  fact  the  analysis  indicates  that  ag  is  not  significantly 
different  from  zero.  Consequently,  the  results  of  the  preceding  analysis 
do  not  show  that  both,  internal  and -external  interfaces  are  important  in 
predicting  or  determining  number*  of  problems.  Actually,  the  minimization 
feature  of  the  regression  analysis  computer  program  used  eliminates  Zg, 
resulting  in  a negligible  increase  in  the  standard  error  to  56.6  and 
decrease  in  r to  0.981.  The  overall  ‘‘grand-average"  regression  cr  influence 
coefficient  of  7.4  becomes  tjj'  « 0.05665. 

Nevertheless  if  both  Z4  and  Ze  are  a priori  hypothesized  important, 
their  relative  influence  can  be  estimated  by  calculating  the  numbers 
of  problems  contributed  by  each  in  the  two  variable  linear  regression. 

Thus  if  N4,  N6  are  numbers  of  problems  contributed  by  each  parameter, 
respectively,  then 


N4  * $ij  Z4 

* (0.0454)  (4573.625)  = 208 

k = $6 


= (0.254)  (202.25)  = 51 
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Parameter  Croup  Whore  Outlier  Appeared* 


Difficulty 


Routine  Weighted  No.  of  Application  System)  I/O  Comp,  (land 


branches 


Stmts  Stmts  Stmts  Design  Code  Implementation  Checkout 


Averaqe  D1 
of  Alt  Ov. 


Each  entry  in  this  part  of  the  table  represents  the  routine  group  In  which  the  outlier  routine  was  found. 


Table  4-27. 


Analysis  of  Project  3 
Outlier  Routines 


Difficulty  Ratings 

Average 

Suctncd 

Difficulty 

Routine 

of  Subsystem 

Factors  Contributing  to  Difficulty  Ratings 

tat ion 

Checkout 

Documentation 

Difficulty 

Outliers 

r 

3 

3 

mm 

■ 

Coeplcx  logic 

E 

3 

3 

Interface  with  routine  A303 

K 

2 

1 

7 

Interface  with  routine  A303 

i 

3 

3 

IS 

Changing  requirements,  complex  logic 

j 

3 

2 

11 

Late  design  reviews 

3 

2 

10 

Interfacing  software  not  available  (late) 

3 

2 

12 

II 

“ 

3 

2 

13 

13.0 

Poor  requirements,  core  load  problems 

3 

3 

WBm 

Changing  requirements,  complex  logic  and  Interfaces,  core  load  problems 
Complex  logic,  core  load  problems 

3 

3 

3 

2 

\ : ■ B hi  . 

13.2 

Changing  requirements,  complex  interfaces 
Complex  interfaces 

2 

3 

3 

2 

12 

Coeplcx  logic 

3 

3 

14 

Changing  rcquircimnts 

3 

2 

■H 

13.5 

Easily  modularized  computational  code 

■ 

2 

3 

Complex  logic,  multiple  designers 

1 

1 

2 

11 

11.0 

Changing  requirements,  complex  man/machinc  interface 

I 

2 

2 

11 

Complex  interfaces 

I 

3 

2 

13.0 

Changing  requirements,  complex  interfaces 

2 

2 

Ccranlcx  logic  and  interfaces 

I 

3 

3 

15 

Changing  requirements,  complex  logic  and  interfaces 

3 

3 

15 

Changing  requirements,  complex  logic  and  interfaces 

1' 

3 

2 

13 

13.5 

Complex  logic  and  interfaces 

3 

2 

11 

Complex  logic  and  interfaces 

Average  Difficulty 
vf  All  Outliers 

12.8 

■tine  was  found. 
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where  Zi*,  Z6  aro  the  respective  means  of  the  Zi«,  Z6  given  in  Table  4-25. 

These  results  would  indicate  that  Zt,  (number  of  branches)  contributes  about 
a ratio  of  4:1  of  software  problems  as  does  ZG  (number  of  applications  pro- 
gram interfaces). 

4 . 3. 5. 8 Single  Variable  Linear  Regression  Analysis 

Although  number  of  problems  can  nearly  always  be  predicted  with  smaller 
standard  error  when  two  or  more  parameters  are  allowed  in  the  regression 
function,  as  compared  with  using  a single  parameter,  the  single  parameter 
regression  analyses  were  repeated  (forcing  *he  regression  line  through  the 
origin,  with  some  exceptions)  with  additional  purposes.  The  first  was  to 
attempt  to  remove  "outliers"  or  routines  in  which  the  parameter  data  were 
anomalous  statistically.  Subsequently,  having  Identified  the  outlier 
routines  in  this  manner,  the  background  data  on  each  routine  were  reexamined, 
and  in  nearly  all  cases  such  observations  as  "complex  logic",  "changing 
requirements",  "schedule  problems",  could  be  correlated  with  the  outlier 
routine. 

Secondly,  assuming  normality  and  constant  variance  (of  nunfcer  of  pro- 
blems over  the  full  range  of  the  paraiwster),  90  percent  prediction  limits 
for  the  number  of  software  problewis  were  calculated.  The  data  points, 
regression  line  and  90  percent  prediction  limits  are  shown  plotted  in 
Figure.,  4-30  for  a selected  set  of  parameters.  The  plotted  regression  line 
and  prediction  limits  are  calculated  from  the  smoothed  data,  obtained  by 
eliminating  the  outliers,  which  are  also  shown  on  the  plots.  The  criterion 
for  rejection  of  outliers  was  chosen  as  ±3  times  the  standard  error  of 
regression  for  the  original  data,  then  the  criterion  was  applied,  repeatedly 
if  necessary,  to  the  data  without  the  outliersf  until  no  more  outliers 
occurred. 

Table  4-27  summarizes  survey  information  obtained  previously  on  relative 
"difficulties"  of  the  outlier  routines  shown  in  the  selected  plots  in  addi- 
tion t.o  those  routines  identified  as  outliers  for  the  remainder  of  the 
reduced  set  of  parameters,  as  well  as  some  additional  parameters  for  which 
the  smoothing  analysis  was  accomplished.  In  a very  few  of  these  latter 
cases  ±2  standard  errors  was  used  as  an  outlier  rejection  criterion. 

Section  2.0  describes  the  manner  in  which  the  "difficulty"  data  were  obtained. 
As  Table  4-27  shows,  in  comparison  to  all  routines  the  outlier  routines 
had  been  assessed  as  having  a significantly  higher  average  difficulty 
measure  (Table  4-28).  In  only  one  case,  an  outlier  routine  was  identified 
(D104),  which  had  a significantly  Urjor  number  of  problems  relative  to  its 
number  of  branches,  all  others  having  significantly  high  numbers  of  problems. 
Coincidentally,  the  comment  by  the  developer  on  that  particular  routine  was 
"...easily  modularized  computational  code". 

4. 3.5.9  Analysis  of  Parameter  Categories 

Each  parameter  represents  a metric  which  serves  to  measure  a correspond- 
ing attribute,  as  indicated  by  the  label.  In  order  to  correlate  the  metric 
with  the  attribute  it  purports  to  measure,  software  problems  of  Project  3 
were  grouped  into  "Interface,"  "Data  Handlirfg."  "Computational"  and  "Logical" 
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Figure  4-30.  Software  Problems  Vs.  Number  of  Branches 
(Sheet  4 of  10) 
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Figure  4-30.  Software  Problems  Vs.  Number  of  Branches 
(Sheet  6 of  10) 


Figure  4-30.  Software  Problems  Vs.  Number  of  Application  Program  I/iterfaces 
(Sheet  7 of  10) 


categories  denoted  by  Yj , Y[}(  Yrj,  t.nd  Yl,  respectively.  The  results  of 
analysis  in  this  section  examine  the  relationship  between  number  of  interface 
problems,  Yj , versus  number  of  interfaces,  as  measured  by  Zj3  (total  number 
of  system  and  application  program  interfaces);  similarly  the  relationships 
Y Q versus  Z^.  Yr  versus  Z9  and  Y^  versus  Z$,  the  numbers  of  problems  of 
each  category  with  the  corresponding  parameters. 

To  do  this  a standard,  with  intercept,  regression  analysis  was  performed 
in  each  case  using  the  values  of  Y,  Z for  «>dch  of  the  25  functions  defined 
previously;  Ai,  A2,  ....  H2, 

Figures  4-31  show  by  the  fairly  high  positive  correlation  coefficients 
for  Data  Handling  problems  (Yd)  and  Logic  Problems  (Yl)  that  the  parameters 
Zlt-  and  Z5  ir  fact  provide  good  predictors  of  Yq  and  Yl,  respectively.  The 
remaining  two  parameters  Zj3  and  Z9  are  only  "fair"  predictors  for  Yi  and 
Yc»  respectively. 
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Computational  Problems  versus  Computational  Statements 
Level) ^Sbeet  2 of  4) 


Figure  4-31.  °roject  3 Interface  Problems  versus  Interfaces  (Function  LevH)  (Sheet  3 cf  4) 


logic  statements  per  function 

Project  3 Logic  Problems  versus  Logical  Statements  (Function 


4.4  Ana lvbls  of  Error  Causes  and  Symptoms 


i 


An  effort  was  made  to  develop  a means  of  independently  classifying 
both  causes  and  symptoms  of  errors  using  Project  5 data.  Section  3.5 
presents  a discussion  of  the  categories  created  in  the  attempt  to  define 
specific  causes  of  software  errors.  The  list  of  symptomatic  categories 
shown  in  Table  4-29  was  created  in  order  to  provide  a basis  for  deriving 
general  relatlonil.ins  between  software  error  causes  and  symptoms  in  other 
software  projects.  The  list  of  causative  categories  is  contained  in 
Table  3-2  (Section  3.5). 

A total  of  365  Project  5 problem  reports  were  examined  and  their 
symptoms  categorized.  A two-way  table  of  frequencies  for  causes  and 
symptoms  was  then  produced  for  the  25  symptoms  and  93  causes.  Owing  to 
the  relatively  few  entries  in  each  of  the  93  causes,,  or  detailed  error 
categories,  it  was  decided  to  recast  the  frequencies  into  major  causative 
error  category  groups,  A,  B,  K,  X,  by  totalling  all  frequencies 
within  a group.  Table  4-30  presents  the  reduced  frequency  data  with  the 
25  symptom  categories  and  12  major  causative  categories.  Some  quanti- 
tative relationships  are  apparent  but  in  general  very  few  meaningful 
inferences  suggest  themselves.  The  following  appear  to  be  worthwhile 
observations: 

(1)  The  most  frequently  appearing  symptoms  are  shown  below,  together 
with  the  percent  of  total  problems  (365)  exhibiting  the  given 
symptoms. 


Symptom 

No. 

No.  of 
Problems 

Percent  of  Total 
Frequency 

2 

111 

30.4 

15 

60 

16.4 

12 

53 

14.5 

11 

33 

9.0 

18 

18 

4.9 

7 

16 

4.4 

291 

Total  79.6% 

Thus  291/365  = 79.6!*  or  nearly  80%  of  the  problems  have  the 
listed  symptoms,  with  frequencies  in  the  order  given. 

(2)  For  each  of  the  higher  frequency  symptoms  given  in  the  above 
table,  the  causes  may  also  be  tabulated  in  order  of  frequency, 
starting  with  the  highest  frequency  cause. 
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Symptom  Causes,  In  decreasing  order  of  frequency  for 

No,  given  symptom. 


2 

B 

(»«). 

A 

• • 

(26S) , D (12X), 

H (9*) 

15 

0 

(65%), 

I 

(12X),  X (12X) 

12 

H 

(53*), 

G 

(23X) 

11 

F 

(21*). 

c< 

;i5X),  X (15X),  H (12%),  K (12X) 

18 

0 

(5«*). 

B 

(33*) 

7 

F 

(310. 

X 

(19*),  8 (12*), 

E (12X) 

Thus,  given  that  Symptom  2 (e.g.,  answer  out-of- tolerance) 
occurs,  the  probabilities  that  cause  B or  cause  A are  the 
source  of  the  error  would  be  estimated  as  30%  and  26X, 
respectively.  In  other  words,  the  conditional  probability 
that  causes  A or  B (computational  or  logical  errors, 
respectively)  were  operative  is  estimated  as  56X. 

The  total  frequencies  for  each  cause  do  not  show  one  or  more 
predominant  causes,  but  rather  that  causes  0,  H,  B occurred  most 
frequently,  and  nearly  equally,  and  that  the  remainder  of  the  causes 
taper  slowly  down  In  frequency.  In  the  order  A,  X,  0,  F,  I,  G,  C,  E,  K. 
'ince  causative  category  J is  soriewhat  of  a "catch-all causes  H 
(Data  base  errors)  and  B (Logic  errors)  represent  the  most  frequent 
specific  causes. 

It  can  perhaps  be  concluded  from  the  cursory  examination  of  the 
Project  5 software  problem  data  described  above  that  little  is  to  be 
gained  by  attempting  to  define  and  associate  causes  and  symptoms  of 
software  errors,  at  least  at  present.  This  is  only  one  aspect  of  the 
problem  of  diagnosing  and  correcting  errors  based  upon  the  observed 
symptoms. 

From  the  viewpoint  of  using  this  type  of  information  as  a diagnostic/ 
maintenance  aid  to  uncover  software  error  causes,  it  appears  that 
considerable  refinement  in  both  the  cause  and  symptom  categories  are 
needed.  To  be  useful  in  the  detection  of  causative  errors  (i.e., 
debugging)  the  combinations  of  symptoms  would  need  to  include  not  only 
the  symptom  description,  Table  4-29,  but  also  certain  auxiliary  data 
such  as  a branch  execution  frequency  table,  a list  of  set/use  discrep- 
ancies, and  othar  dynamically  obtained  data.  To  obtain  these  additional 
data  at  the  time  of  the  software  failure  requires  that  the  operational 
software  be  instrumented  and  that  necessary  software  tools  reside  in 
core  during  execution. 
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Table  4-29.  Symptomatic  Categories 


DESCRIPTION 

Void 

Data  Overflow 

Incorrect  Processing  (i.e.,  wrong  answer,  answer 
or  processing  action  not  as  expected) 

Abort 

Premature  Program  Exit 
Loop 

Too  Much  Output  Produced 

Routine  - Routine  Incompatability 

Maximum  Time  Allotment  Exceeded 

Incorrect  Initialization  Processing 

Incomplete  Processing 

Routine  - Routine  Data  Incompatibility 

Routine  - Data  Base  Incompatibility 

Run  Time  Too  Long 

Valid  Data  Destroyed 

Change  Control  Activity 

Standards  Violation 

Routine  Lacking  Required  Capability 

Routine  Lacking  Capability  (Enhancement) 

Inaccurate  Processing 

No  Output 

Software  - Hardware  Incompatibility 

Valid  Data  Not  Used 

Core  Storage  Exceeded 

Software  - Documentation  Incompatibility 


Table  *-30.  Project  5 Software  Problem  Frequencies  by  Major  Causative  and  Symptomatic  Categori 


4.5  Ancillary  Investigations 

In  the  course  of  examining  and  trying  to  understand  software  error 
data  documented  during  testing,  it  quickly  became  obvious  that  the  picture 
of  causes  is  much  more  complex  than  had  even  been  anticipated  prior  to  the 
brj inning  of  the  study.  During  the  data  collection  process,  it  also  became 
obvious  that  there  was  a tremendous  amount  of  information  available,  not 
all  of  it  test  related,  that  might  be  useful  in  understanding  the  software 
development  cvcle  as  a whole,  as  well  as  the  test  phase. 

In  the  sections  which  follow  are  presented  results  of  several  ancil- 
lary Investigations  of  Project  3 data  which  were  found  to  be  interesting. 
Not  all  investigations  attempted  are  presented  here  because  of  inconclusive- 
ness  and,  in  some  cases,  lack  of  confidence  in  the  quality  or  quantity  of 
data.  It  will  be  noted  that  some  of  the  results  presented  here  a"»  incon- 
clusive, tool  similar  investigations  with  other  data  sets  may  be  more  suc- 
cessful. In  each  investigation,  an  attempt  is  made,  where  possible,  to 
identify  preconceived  notions  and  expected  results  (hypotheses?)  held  by 
the  investigators.  An  attempt  is  also  made  to  outline  conditions  and 
qualifiers  needed  to  understand  results. 

4.5.1  Personnel  Ratings 

The  objective  of  this  investigation  was  to  determine  if  personnel  rat- 
ing parameters  could  be  used  to  explain  error  rates  in  the  software.  The 
feeling  was  that  programmer  experience*,  alone,  is  a poor  indicator  of 
quality  or  error-freenass  of  the  software.  As  was  mentioned  in  Section  2.0 
it  was  further  felt  that  addition  of  a programmer  load  factor  would  improve 
the  correlation  between  error  rates  and  personnel  ratings. 

In  this  investigation  only  TRW  developers  were  considered.  These 
individuals  (76  in  this  instance)  were  expected  to  design,  code,  debug, 
document,  test  (routine  level),  and  maintain  TRWs  portion  of  the  software 
system.  Programmer  ratings,  a function  of  knowledge,  Intelligence,  ini- 
tiative, and  responsibility,  were  assigned  by  managers.  A workload  factor 
was  also  determined. 

The  contention  that  experience  alone  is  a poor  indicator  was  suggested 
in  a previous  study  [5]  and  this  result  was  borne  out  again  in  this  study. 

The  contention  that  programmer  rating  and  load  factor  would  improve 
the  correlation  did  not  hold.  There  was  no  correlation  (r  * 0)  between 
programmer  rating  and  error  rates,  in  errors/100  statements.  One  very 
interesting  fact  was  discovered,  however.  In  plotting  each  programmer's 
error  rate  against  his  rating,  it  was  noted  that  one  group  of  programmers 
had  both  high  ratings  (>16)  and  high  error  rates  (arbitrarily  set  at 
>2.0  errors/100  statements).  Further  examination  showed  that  these  pro- 
grammers were  mostly  technical  work  unit**  managers  who  had,  for  one 


In  the  Project  3 environment,  at  least. 

A work  unit  should  not  be  confused  with  a chief 
programmer  team. 
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reason  or  another,  retained  some  of  the  software  as  their  own  responsi- 
bility, This  responsibility,  added  to  the  responsibilities  of  managing 
from  five  to  fifteen  people,  attending  technical  and  management  meetings, 
finding  and  allocating  needed  resources,  and  generally  serving  as  an  inter- 
face between  the  programmers  and  upper  management  and  the  customer, 
detracted  from  their  ability  to  produce  error-free  code.  Figure  4-32 
presents  error  rates  vs.  programmer  rating  for  both  the  programmers  and 
work  unit  managers.  Typically,  these  work  unit  managers  are  successful 
programmers  who  are  rewarded  with  a management  position.  Dividing  the 
error  rate  of  each  programmer  by  his  load  factor  helped  to  bring  these 
managers  into  the  range  of  all  other  programmers,  Figure  4-33.  The 
percent  difference  between  programmer  and  work  u:.it  manager  error  rates 
changes  from  34  percent  to  12  percent  using  this  technique. 

The  message  here  may  be  to  let  these  technical  managers  manage  and 
keep  than  away  from  the  code.  This  may  be  easier  said  than  done  in  real 
world  software  development  situations  where  time  and  manpower  resources, 
as  well  as  tools  to  assist  with  tedious  tasks,  are  in  short  supply. 

As  far  as  programmer  ratings  go,  those  used  in  this  study  were 
clearly  lacking  in  their  ability  to  explain  error  rates.  In  one  specific 
instance  where  a highly  rated  programmer,  not  a work  unit  manager,  had  a 
high  error  rate  it  was  discovered  that  this  individual  was  going  through 
the  throes  of  divorce  at  the  time,  and  this  fact,  unknown  by  anyone  at 
the  time,  affected  the  Individual's  performance.  Any  successful  program- 
mer rating  scheme  is  going  to  have  to  take  into  account  some  of  the  less 
tangible  characteristics. 

4.5.2  Errors  and  Programmer  Assignments 

The  objective  of  this  investigation  was  to  determine  if  the  number  of 
programmers  assigned  to  work  on  a particular  routine  had  any  effect  on  the 
error  rate  (actual  errors  per  100  source  statements)  for  the  routine.  The 
hypothesis  in  this  Investigation  was  that  routines  suffer  higher  error 
rates  as  a result  of  having  more  than  one  programmer  assigned  the  respon- 
sibility of  producing  the  routine. 

Conditions  and  qualificatioi  *rich  apply  in  this  investigation  are 
as  follows: 

• Only  TRW  routines  were  examined  due  to  the  lack  of 
co-contractor  personnel  data  for  Project  3. 

t Multiple  assignments  occurred  throughout  the  develop- 
ment cycle.  That  is,  design,  coding,  and  testing 
were  all  subject  to  multiple  assignments  of  developers. 

t Detailed  assignments  of  percent  work  completed  by  each 
programmer  for  each  task  were  not  available.  Nor  was  it 
possible  to  determine  the  type  of  multiple  assignment 
very  accurately,  i.e.,  whether  it  was  in  parallel  or 
serial  in  nature  (or  some  combination  of  each). 
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Figure  4-32.  Prograrrmer  Rating  versus  Rate  per  100  Statements 


Figure  4-33.  Programmer  Rating  versus  Error  Rate  Load  Factor 
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Results,  bearing  in  mind  the  above  szjcata,  did  not  support  the  con- 
tention that  "too  many  rooks  spoil  the  stew."  In  fact,  routines  with 
only  one  programmer  assigned  had  a slightly  higher  average  error  rate, 
2.A7/1Q0  statements,  than  routines  with  two  and  t'*r*e  or  more  programmers 
as>igned,  1.87  and  1.97  errors/100  statements,  respectively  (see  Fig- 
jre  4-34).  Routines  with  high  error  rates  (>2.0  errors/100  statements) 
were  typically  subject  to  so  many  adverse  development  conditions  that  it 
would  be  difficult  to  determine  just  how  much  influence  personnel  assign- 
ments had.  As  one  °roject  3 manager  poin.ed  out,  assignment  of  multiple 
developers  may  have  had  a beneficial  effect  because  these  assignments  were 
typically  made  in  times  of  crisis  when  the  fresh  viewpoint  and  added  man- 
power were  needed. 

Rote  in  Figure  4-35  that  multiple  programmer  assignments  were  made  on 
the  larger  routines.  An  average  routine  size  of  574  statements  character- 
ised the  routines  with  three  or  more  programmers  assigned,  while  the  aver- 
age routine  sizes  for  two  and  one  programmer  assignments  were  430  and  4*9 
statements,  respectively. 

As  part  uf  the  investigation  of  programmer  assignment,  there  was  an 
effort  tu  determine  if  the  highly  rated  programmers  were  assigned  to  the 
larger  and  generally  more  complex,  routines.  Table  4-31  summarizes  the 
finding  that  there  was  a tendency  for  this  tu  be  true  with  a 13.42  differ- 
ence between  maximum  and  minimum  average  programmer  ratings.  Perhaps  the 
most  interesting  finding  in  this  investigation,  however,  was  the  fact  that 
routines  that  experienced  no  problems  (i.e.,  no  code  change  errors  dis- 
covered) were  predominantly  the  ones  smaller  than  300  total  statements  ip, 
length. 

To  improve  investigations  of  this  type  more  data  collected  in  a con- 
trolled manner  are  needed.  These  data  should  indicate  how  programmers  u>-e 
using  their  time  as  well  as  where  overlaps  and  duplication  occur. 

4.5.3  Problem  Report  Closure  Time 

The  objective  of  this  investigation  was  to  determine  the  mean  t'me 
required  to  close  a software  problem  report.  That  is,  the  time  required 
to  identify  the  error,  develop  and  test  a fix,  and  close  out  the  software 
problem  report.  Project  3 data  are  used  here  because  of  availability  jnd 
because  of  the  sense  of  urgency  felt  by  project  performers  to  find  and  fix 
problems  as  quickly  as  possible.  Additional  information  needed  in  this 
investigation  is  as  follows: 

* Validation  testing  was  conducted  by  software  contractors 

• Acceptance  testing  was  attended  by  customer,  user,  and 
customer  technical  assistance  personnel. 

9 Integration  testing  \hs  conducted  by  a system  integra- 
tion contractor  not  colocated  with  the  software  con- 
tractors who  fixed  errors. 


i 
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Figure  4-34.  Routine  Crror  Rates  for  Various  Programmer  Assignments 


Table  4-31.  Average  Programmer  Ratings  According  to  Routine  Size 


NUMBER  OF 
STATEMENTS 
PER  ROUTINE 

NUMBER  OF 
TRW  ROUTINES 

NUMBER  OF 
ROUTINES  WITH 
NO  PR06LEMS 

AVERAGE  PROBLEMS 
PER 

100  STATEMENTS 

AVERAGE 

PROGRAMMER 

RATING 

0-100 

51 

18 

4.09 

14.2 

101-200 

25 

4 

1.81 

15.5 

201-300 

12 

2 

2.22 

15.8 

301-500 

23 

2 

1.85 

13.9 

3C1-400 

11 

1 

1.84 

401-500 

12 

1 

1.86 

501-800 

29 

1 

1.52 

14.9 

501-600 

13 

1 

1.49 

601-700 

9 

0 

1.61 

701-800 

7 

0 

1.45 

aoi-i5oo 

21 

0 

i .38 

16.1 

801-1000 

9 

0 

1.14 

1001-1500 

12 

0 

1.54 

1 1500  + 

11 

G 

1.47 

16.4 

1501-2000 

6 

0 

1.53 

2001  + 

5 

0 

1*39 

• Operational  demonstration  was  conducted  by  S/W 
contractors  working  together  in  the  operational 
environment  with  user  assistance. 

• Priorities  were  assigned  to  problems  according  to 
their  impact  on  the  progress  of  testing.  Since  test 
cases  were  patterned  after  operational  scenarios,  a 
high  priority  test  problem  would,  in  general,  also  be 
a high  priority  problem  in  the  operational  environ- 
ment. Priorities  were 

HIGH  - Problem  prevents  execution  of  a test  case  or 
seriously  impedes  demonstration  of  software 
requirements.  (Continued) 
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MEDIUM  - Problem  impacts  successful  execution  of  a test 
case  but  useful  output  is  available  and  testing 
can  continue. 

LOW  - Problem  does  not  impact  demonstration  of  soft- 
ware requirements  and  a workaround  exists. 

• Not  all  problem  reports  received  a priority.  These  data 
pints  were  not  considered,  nor  were  problem  reports 
designated  as  product  improvements.  The  sample  size 
used  here  is  1325  SPRs. 

The  mean  times  required  to  close  problems  followed  trends  that  might 
be  expected,  see  Figure  4-36*.  In  each  tert  phase  high  priority  problems 
were  closed  first.  Note  that  the  parametc • here  is  not  the  At  to  fix  the 
problem  but  the  At  between  the  problem's  identification  and  the  time  its 
fix  was  checked  out  ^nd  available  for  use  in  formal  testing.  For  high 
priority  problems  the  inactive  time  in  the  queue,  when  the  problem  was 
open  but  not  being  attended  to,  was  minimal  because  of  the  pressure  to 
close  all  high  priority  or  test  limiting  problems.  Inactive  time  for 
medium  and  low  priority  problems  was  substantial  by  comparison,  although 
no  data  exist  to  quantify  this  At. 

Validation  and  acceptance  test  cases  and  test  objectives  were  basi- 
cally the  same.  The  mean  times  to  close  high  and  medium  priority  problems 
were  virtually  the  same,  too.  The  increased  pressure  to  correct  problems 
during  acceptance  testing  (due  to  customer  involvement  and  impendinq 
delivery  dates)  probably  accounted  for  the  drastic  reduction  in  the  time 
required  to  close  the  generally  easier-to-fix  low  priority  problems. 

Integration  testing  occurred  some  distance  from  the  location  of  the 
problem  fixers.  The  pressure  to  close  problems  during  integration  was 
similar  to  that  during  acceptance.  There  was  a one  day  transit  time 
between  the  test  group  and  the  fixer  of  the  problem,  which  would  explain 
the  two  day**  difference  between  acceptance  testing  and  integration  test- 
ing for  high  and  low  priority  problems.  The  low  At  for  medium  priority 
problems  is  unexplained. 

The  operational  demonstration  concerned  itself  only  with  high  and 
medium  priority  problems,  leaving  low  priority  problems  for  later  closure; 
therefore  low  priority  problems  were  not  considered  here.  The  lower  mean 
on  high  and  medium  priority  problems  is  due  to  fewer  total  problems  to  cor- 
rect, lesser  severity  of  problems  encountered,  and  probably  to  some  extent 
the  experience  the  problem  fixers  had  in  correcting  problems. 


^Points  are  connected  in  this  figure  to  amplify  trends. 

One  day  for  delivery  of  the  problem  description  via  SPR  and  one  day  for 
delivery  of  the  fix. 
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Project  5 data  was  not  examined  because  factors  influencing  closure 
time  are  not  so  easily  identifiable  and  the  pressure  to  effect  closure 
is  not  that  of  a production  process,  but  that  of  an  R&D  program. 

4.5.4  Design  Problems  and  Software  Problems 

The  objective  of  this  investigation  was  to  determine  if  the  number  of 
design  problem  reports  (OPRs)  generated  during  the  design  review  process 
of  Project  3 might  serve  as  an  indicator  of  the  number  of  software  problem 
reports  (SPRs)  generated  in  subsequent  formal  test  phases. 

In  this  investigation  there  were  really  no  preconceived  notions  con- 
cerning the  relationship,  just  a healthy  curiosity.  The  data  used  were 
raw  counts*  of  DPRs  written  against  the  same  software  that  was  subject  to 
SPRs  during  validation,  acceptance,  integration,  and  operational  demonstra- 
tion testing.  Also,  only  TRW  software  is  represented  here  due  to  lack  of 
co-contractor  DPR  data. 

Correlation  at  the  function  level**  was  fairly  good,  r ■ 0.969 
results  are  graphically  presented  in  Figure  4-37.  There  was  a definite 
tendency  for  functions  which  were  criticized  during  design  reviews  to  also 
bo  the  subject  of  test  SPRs.  One  interesting  point  on  Figure  4-37  is  the 
outlier  wUh  relatively  few  DPRs.  Further  investigation  revealed  that  this 
function  was  one  believed  to  be  of  relatively  lower  concern  because  it  was 
similar  to  software  built  for  other  projects,  i.e.,  it  wasn't  considered 
"new".  This  function,  Cl,  was  not  an  outlier  in  other  test  data  investi- 
gations, suggesting  that  it  did  not  receive  as  thorough  a design  review  as 
other  functions  in  the  system. 

It  is  believed  that  much  can  be  gained  in  terms  of  developing  improve- 
ments in  the  design  process  by  examination  of  problems  documented  during 
design  revievs.  From  the  limited  analysis  of  DPRs  done  in  this  study  and 
the  results  of  the  investigation  of  principal  error  sources,  Section  4.2, 
this  would  seem  to  be  the  case.  Such  examinations  should  include  detailed 
categorization  of  the  type  of  error  and  the  design  activity  that  produced 
the  error.  Error  severity  and  the  amount  of  review  given  each  particular 
segment  of  the  design  would  be  data  collected  to  support  studies  of  this 
type. 


•ft 

No  attempt  was  made  to  categorize  these  according  to  the  amount  of  review 
given,  type  of  problem  they  documented,  cr  their  severity.  This  would 
have  involved  retrospective  analysis  of  over  5000  DPRs,  and  time  did 
not  permit  this. 

At  the  routine  level  a linear  fit  was  not  as  good. 


> 
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Figure  4 


. Design  Problems  and  Software  Problems  (Project  3) 


4.5.5  Computer  Usage  vs  Software  Problem  Reports 

There  has  been  considerable  discussion  in  the  software  community 
concerning  the  possible  relationship  between  computer  usage  and  the  number 
of  resulting  software  problems,  with  particular  emphasis  on  developing 
mean  time  between  failure  (MTBF)  values  applicable  to  software. 

TRW  data  from  Project  3 validation  and  test  phase  provided  computer 
time  data  in  order  to  explore  this  possibility.  Since  there  was  reason 
to  suspect  that  the  distribution  of  time  versus  SPRs  might  be  dichotomous, 
computer  time  and  problem  report  data  were  divided  into  two  categories: 
formal  test  activity  and  developer  activity.  Test  activity  was  all  test- 
ing performed  by  the  independent  test  group  personnel  to  specified  test 
procedures.  Developer  activity  was  the  computer  time  used  for  the  veri- 
fication, solution  and  checkout  of  software  problems. 

In  the  scatter  diagram,  Figure  4-33,  illustrating  weekly  computer 
usage  and  problem  reports  for  both  activities,  it  is  apparent,  from  the 
pattern  formation  and  the  vertical  separation  of  the  means  of  the  data 
sets,  that  the  date  represent  two  populations.  This  can  be  explained  by 
examining  the  nature  of  the  two  activities  using  computer  time.  The  test 
function  operates  to  a schedule,  reporting  problems  encountered  randomly 
in  the  course  of  planned  testing.  Concurrently,  the  designers  use  com- 
puter time  on  an  as-needed  basis,  specifically  for  the  verification  of 
problems,  correction  and  retest,  occasionally  writing  problem  reports 
when  additional  unrelated  problems  are  encountered  or  when  a problem  is 
not  within  the  designers'  jurisdiction.  Thus,  generally  speaking,  the 
test  activity  finds  problems,  while  the  developer  activity  solves  problems. 
The  random  nature  of  the  designers'  activity  is  apparent  in  the  scatter 
diagram,  clearly  showing  no  measurable  relationship  between  the  variables. 

Ustng  a CDC  6500  applications  library  program,  curves  were  fit  to  the 
weekly  test  activity  data  (minus  one  out-lying  point)  to  measure  linear! cy. 
The  best  approximation  of  the  curve  proved  to  be  the  hyperbolic  function 
of  the  form  Y = A+(3/x),  Figure  4-39,  with  the  worst  point  estimation  (actual 
value)  being  -22. 1"  from  the  calculated  value.  The  correlation  coefficient 
(r)  was  0.87899  and  the  index  of  determination  (r^)  of  0.77262  implied  a 
77  dependence  of  the  dependent  variable  (SPR's)  on  the  amount  of  computer 
time  used. 

The  daily  test  activity  was  then  plotted,  Figure  4-40,  to  determine 
whether  or  not  the  relationship  held  for  daily  activity.  It  is  clear  that 
it  does  not,  providing  no  possible  estimation  of  daily  SPR  activity. 
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Extreme  caution  should  be  used  in  projecting  this  data  to  other  tost 
phases  or  other  software  projects.  While  the  statistics  imply  a strong 
relationship  of  the  variables  on  a weekly  basis,  the  relationship  was  not 
substantiated  by  the  daily  data.  Further,  there  ware  very  few  data  points 
on  which  to  generate  the  curve.  Logic  for  the  removal  of  the  out-lying 
point  was  based  on  a number  of  factors,  such  as  inordinate  testing  of  one 
section  of  software,  out  the  extent  of  actual  influence  of  these  factors 
cannot  be  quantified. 

In  addition,  the  generation  of  problem  reports  was  not  controlled. 
Writing,  or  not  writing,  an  SPR  was  a matter  of  judgment  rather  than  a 
defined  requirement.  Lack  of  control  tended  to  cause  reports  to  be  written 
which  did  not  reflect  true  problems.  Since  the  effect  cannot  be  measured, 
actual,  singular  problems  could  not  be  examined  against  the  actual  computer 
time  which  produced  them. 

Observations  which  can  be  noted  are: 

1}  There  may  be  a valid  relationship  between  computer  usage  and 
resulting  problem  reports  when  the  data  ore  viewed  on  a gross 
basis.  This  may  provide  some  management  insight  for  planning 
purposes. 

2)  Ti:e  results  suggest  justification  to  further  explore  the  rela- 
tionship and  possibly  to  improve  control  of  problem  reports  and 
computer  time  on  future  programs. 

3)  No  attempt  should  be  made  to  establish  MTBF  values  during  this 
test  phase.  Since  the  daily  data  does  not  corroborate  the  weekly 
findings,  the  relationship  is  not  sufficient  to  sugoest  an  MTBF 
could  be  valid. 

4)  Analysis  for  other  projects  on  test  phases  shculd  consider  the 
possibility  of  dichotomous  data  and  thus  thoroughly  examine 
the  circumstances  concerning  computer  usage  and  the  related 
problem  reports. 

4.5.6  SPR  Activity  for  Project  3 

Project  3 SPR  Activity  during  validation/acceptance,  integration, 
and  operational  demonstration  test  phases  shown  in  Figure  4-41  provides 
some  food  for  conjecture.  Shown  in  this  histogram  are  the  SPRs  per 
week  for  TRW  and  the  co-contractor. 

During  the  validation  and  acceptance  period  the  co-contractor  activity 
was  greatly  influenced  by  the  availability  of  the  TRW  software,  which 
performed  the  executive  and  major  initialization  functions.  Gradual  build- 
up of  the  co-contractor  problems  as  TRW  software  was  validated  is  evident 
from  the  histogram. 
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Beginning  wi*h  the  9th  week  of  the  validation/acceptance  period  two 
factors  influenced  the  SPP.  output  of  both  contractors.  First,  the  co- 
contractor delivered  a substantial  portion  of  the  software.  Second,  the 
customer's  arrival  with  an  additional  force  of  analysts  to  prepare  for 
and  participate  in  the  forthcoming  acceptance  phase.  Results  of  the 
increased  test  activity  and  available  manpower  are  apparant  in  the  sudden 
increase  in  SPR  activity  of  both  contractors.  The  "table"  effect  of  the 
TRW  activity  suggests  saturation.  That  is,  there  may  be  a physical  limi- 
tation to  the  number  of  SPRs  which  can  be  written,  assuming  there  are  so 
many  problems  that  the  simple  act  of  looking  for  them  results  In  SPRs.  The 
sudden  increase  in  the  average  of  TRW  SPRs/week,  along  with  the  availa- 
bility of  customer  manpower,  and  a similar  phenomenon  for  the  co-contractor 
activity  may  also  be  evidence  to  this. 


If  the  abovo  assumption  is  correct,  it  might  be  expected  that  once  the 
"residual"*  errors  dropped  below  the  SPRs  saturation  level,  the  number  of  SPRs 
generated  would  exhibit  a continual  decline.  The  fact  that  this  did  not 
occur  on  a per  contractor  basis  implies  that  validation  and  acceptance 
testing  was  stopped  prematurely. 

Build-up  of  activity  during  the  integration  phase  is  most  likely  due 
*».•  learning  curve  on  the  part  o*  the  integrating  test  personnel,  and  the 
fact  that  not  all  tests  are  independent  of  one  another.  Some  tests  provide 
inputs  to  successive  tests,  resulting  in  an  increasing  number  of  tests  which 
can  be  run  as  time  progresses.  Tail-off,  following  the  mid-point  peak, 
results  rron  the  discovery  of  errors  that  the  test  cases  are  designed  to 
find. 

The  operational  demonstration  test  phase  is  short  and  merely  shows 
the  mmter  of  problems  discovered  during  the  period  by  running  tests  that 
simulated  operational  usage.  Once  these  problems  were  closed,  the  product 
was  received  by  the  customer  for  operational  use. 
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Note: 


Residual  •for  the  specific  test  cases  being  examined. 
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A. 5. 7 Errors  and  Software  Cost 

The  objective  of  this  investigation  was  to  determine  tn*  relationship 
between  the  number  of  problem  encountered  during  Project  3 p reopera tional 
testing*  and  the  cost  of  the  software.  That  is,  did  the  most  error-prone 
routines  also  cost  the  most  to  produce?  The  feeling  was  that  the  answer  to 
this  would  be  yes  since  previous  studies  showed  that  cost  correlated  fairly 
well  with  size,  ton. 

Using  Project  3 data,  routine  cost  was  determined  for  the  entire  devel- 
opment period,  including  the  design,  coding,  development  test,  and  formal 
test  phases.  It  should  be  noted  that  other  project  costs  -uch  as  top  level 
management,  configuration  management,  and  the  independent  test  costs  were 
left  out  of  this  study  since  accurate  records  on  how  these  organizations 
spent  their  dollars  at  the  routine  level  were  not  available.  Figure  4-42 
shows  a normalized**  plut  of  the  cost  as  a function  of  the  number  of  problems 
documented  for  the  routines  in  one  TRW  subsystem.  For  this  subsystem  there 
was  a definite  positive  correlation,  r * 0.8424,  between  cost  and  the  number 
of  problems.  The  cost  vs.  size  relationship  mentioned  earlier  is  presented 
in  Figure  4-43  for  the  same  Project  3 subsystem  where  r * 0.8899.  Only  one 
subsystem  is  shown  h«re  as  *n  example  of  this  type  of  analysis.  Other  sub- 
systems were  similarly  correlated. 


A 

Validation,  acceptance,  system  integration,  and  operational  demonstration 
testing. 

** 

Cost  normalization  was  done  with  respect  to  the  most  expensive  routine  in 
the  subsystem. 
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4 . 6 Comparison  of  Data  fram  Different  Projects 


The  temptation  to  compare  data  from  different  projects  is  great.  Vet 
the  results  of  comparison  can  be  misleading  unless  the  differences  and 
similarities  between  the  projects  are  well  specified  in  common  terms. 
Differences  between  data  items  and  the  means  of  their  collection  are  also 
needed. 

Our  studies  have  emphasized  just  how  significant  the  problem  of  pro- 
viding a basis  for  comparison  can  be.  For  example,  the  work  with  Project  3 
software  attrioutes  and  metrics.  Section  4.3,  was  thought  to  be  a basis  for 
comparison  of  structural  characteristics  collected  from  Project  5.  Using 
the  logical  complexity*  metric  appropriate  for  describing  complexity  on 
Project  3,  we  very  quickly  found  Project  5 to  be  structurally  fairly  simple, 
project  3 nesting  of  loops  and  IFs  reached  indenture  levels  frequently 
exceeding  the  fifth  level  and  sometimes  reaching  as  high  as  the  10th  level. 
Such  nesting  was  awarded  a high  complexity  rating.  Project  5,  on  the  other 
hand, cannot  afford  this  kind  of  complexity  since  run  time  must  be  mini- 
mized. By  comparison,  it  must  be  structurally  more  simple  than  Project  3. 
However,  if  one  were  to  a sF  people  familiar  with  both  projects  to  rate  the 
complexity  of  each,  Project  5 would  be  rated  more  complex  and  descriptions 
iike  "orders  of  magnitude  mere  complex"  would  be  used.  As  might  be 
imagined,  the  difference  is  in  something  other  than  the  code.  Project  5 
is  solving  a highly  complex  analytical  problem  in  real  time,  a problem 
being  solved  with  algorithms  developed  for  the  first  time.  Project  3, 
although  highly  analytical,  solves  a problem  that  is,  by  comparison,  well 
understood. 

Even  comparison  of  cuch  apparently  straightforward  attributes  as 
size  can  be  misleading.  One  of  the  most  common  measures  of  size  is  in 
machine  language  instructions.  This  is  typically  the  easiest  measure  of 
size  to  collect.  However,  size  at  the  machine  language  level  is  dependent 
on  the  machine.  The  conversion  factor  from  source  language  statements  to 
machine  language  instructions  can  vary  considerably.  What  we  have  found 
is  that  both  measures  of  size  are  necessary,  and  the  source  measure  should 
recognize  the  difference  between  executable  and  nonexecutable  code,  as  well 
as  comments.  Even  then  the  size  picture  is  not  complete  for  comparison 
purposes.  Adefinition  of  the  product  whose  size  is  being  measured  is 
important.  In  the  case  of  Project  3 the  size  of  the  end  product  source 
code  was  very  nearly**  all  the  code  developed.  For  Project  5,  however, 
the  end  product's  size  is  only  a fraction  of  the  cade  developed,  partly 
because  of  the  top  down  approach  in  which  a certain  amount  of  "breakage" 
or  replacement  of  code  is  planned  and  partly  because  of  the  special  tools 
needed  to  develop  real  time  code.  Tn  an  accurate  comparison  of  error  rates 
in  errors  per  unit  size,  or  especially  in  a comparison  of  programmer  pro- 
ductivity, these  things  must  be  considered. 


A measure  of  complexity  based  on  source  code  structure. 
With  the  exception  of  nondeliverable  debug  code. 
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The  problem,  then,  is  that  we're  not  able  to  define  projects  in  com- 
mon terns,  and  it  is  this  problem  which  keeps  us  from  making  many  compari- 
sons in  this  study.  However,  based  on  our  studies  there  is  reason  tc 
believe  that  this  is  not  an  insol vable  problem.  Similarities  in  error 
data  from  more  than  one  project,  our  only  real  area  for  comparison,  are 
encouraging.  The  fact  that  the  error  category  list?,  were  essentially  the 
same  helped.  When  uniform  approaches  to  collecting  and  analyzing  data 
are  developed  and  employed  in  controlled  experiments,  further  comparison 
will  be  possible. 


4.7  Tools  and  Techniques 

In  a general  sense,  many  things  can  be  done  to  Improve  the  software 
development  and  test  processes.  This  is  becoming  increasingly  obvious 
from  recent  papers  in  the  trade  journals  and  the  objectives  of  current 
government  RSD  programs.  Improvements  are  taking  one  of  two  generic  forms, 
either  tools  or  techniques.  Their  applications  are  aimed  at  both  nv?n;ger1al 
and  technological  ‘mprovements  during  virtually  any  phase  of  the  software 
life  cycle,  from  pre-proposal  activities,  through  requirements  analysis, 
design,  coding,  testing,  and  into  operation  and  maintenance.  Tools  are 
defined  for  this  discussion  as  computer  programs  which  perform 
tasks  which  would  otherwise  have  tc  be  done  manually.  Techniques  are 
defined  as  the  standards  and  procedures  used  in  the  development  and  mainte- 
nance of  the  software  system.  The  principal  objectives  of  these  tools  and 
techniques  may  be  summarized  by  the  following: 

• to  eliminate  errors  before  they  get  into  the  product 
(code  or  documentation) 

• to  find  those  errors  that  do  get  into  the  product  earlier 

• to  support  project  performers  in  completion  of  drudgery  tasks, 
freeing  them  to  work  on  technical  problems 

• to  Increase  corrniunication  and  enhance  common  understanding 
between  project  performers,  the  customer,  and  the  user 

Since  almost  any  topf^  dealing  with  the  development  of  software  sys- 
tems may  serve  as  the  target  of  a brainstorming  session  on  tools  and  tech- 
niques, the  full  spectrum  is  understandably  (and  in  this  discussion,  pro- 
hibitively) large  [10,11]  . Suggestions  given  in  the  following  paragraphs 
are  iinrted,  therefore,  to  those  that  appear  to  provide  obvious  benefit 
in  the  Project  3 environment. 

Also,  a certain  economy  in  selection  of  tools  and/or  techniques  is 
suggested  by  available  empirical  data  and  by  References  [12]  and  [13]. 

For  example,  units  conversion  errors  could  be  detected  with  a tool  called 
a Units  Consistency  Analyzer,  but  may  also  be  detectable  through  use  of 
algorithm  string  tests  and  design/code  inspections  or  walk-throughs.  Dur- 
ing Project  3 testing  9.3  percent  of  the  computational  errors  (0.8  percent 
ofall  errors)  could  have  been  detected  by  a units  consistency  analyzer. 
Other  related  computational  errors  which  might  have  been  detected  by  such 
a tool  boost  the  percentage  to  15.9  percent  of  the  computational  errors 
(1.4  percent  of  all  errors).  However,  virtually  every  computational  error, 
including  those  errors  mentioned  above*  (9.0  percent  of  all  errors)  would 
be  susceptible  to  detection  by  multipurpose  design  and  code  inspections  and 


Reifer  in  References  10  and  11  provides  a very  comprehensive  summary  of 
tools  and  techniques  (he  appropriately  calls  these  aids)  that  are  presently 
in  use,  beino  developed.,  or  under  investigation.  Viewgraphs  shown  by 
Reifer  at  the  1975  International  Conference  on  Reliable  Software  present 
additional  information  on  the  subject. 
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algorithm  string  tests.  Therefore,  the  analysis  presented  here  is  limited 
to  that  made  possible  by  Project  3 data,  i.e,,  technological  improvements 
in  the  form  of  tools  and  techniques  applicable  to  the  development  and  test 
phases  of  the  software  life  cycle.  The  approach  taken  in  this  analysis  is 
similar  to  the  one  suggested  by  Curry  in  [12]. 

4.7.1  Techniques 

Techniques  will  be  discussed  first  because  quantitative  results  showed 
them  to  have  tne  greatest  potential  for  improvement.  Data  used  in  this 
analysis  were  the  4519  software  problem  reports  (SPRs)  written  between 
delivery  of  the  code  to  an  independent  test  team  and  the  completion  of  an 
operational  demonstration  just  prior  to  turnover  to  the  customer.  As  was 
noted  in  Section  4.2,  these  were  categorized  according  to  error  type  into 
20  major  categories  and  165  detailed  categories  within  the  various  major 
categories.  The  sample  size  was  reduced  significantly  when  all  SPRs  that 
did  not  produce  a code  change  were  removed  from  consideration . There 
remained  2019  SPRs  which  produced  changes  to  the  code,  and  these  are  the 
data  that  were  used  in  the  analysis  here. 

Next,  an  attempt  was  made  to  determine  which  error  categories  were 
susceptible  to  several  preventive  and  detective  techniques.  It  should  be 
noted  that  these  techniques  won't  guarantee  that  the  errors  will  be  caught, 
but  they  will  tend  to  make  errors  visible,  i.*.,  cerUin  types  of  errors 
will  be  susceptible  to  prevention  or  detection  with  these  techniques. 

Finally,  the  terms  "preventive"  and  "detective"  are  used  loosely  here. 
Any  technique  (or  tool)  whlctn  Is  applicable  prior  to  any  form  of  testing  is 
termed  preventive.  Those  connected  with  tasting  were  termed  detective. 

Table  4-32  presents  a summary  of  general  error  categories  and  the  percent- 
ages of  each  category  susceptible  to  the  various  techniques  being  con- 
sidered"; Table  4-33  presents  the  percentages  of  all  errors,  by  major  cate- 
gory, susceptible  to  the  same  techniques.  Note  that  these  techniques  are 
not  exclusive  in  their  ability  to  locate  errors,  e.g.,  design  standards  and 
design  inspections  both  have  the  capability  to  prevent  the  same  type  errors. 

4. 7. 1.1  Design  Standards 

Design  standards  are  particularly  appealing  in  the  Project  3 environ- 
ment because  sc  much  is  know  about  che  problem  to  be  solved,  the  require- 
ments, and  user  needs.  Using  the  approach  outlined  above,  as  high  as 
28.7  percent  of  the  code  change  errors  found  in  subsystem  and  system  type 
testing  were  judged  susceptible  to  prevention  through  some  sort  of  design 


SPRs  that  didn't  produce  a code  change  were  due  to  out-of-scope  (product 
improvement),  no-problem,  pure  documentation,  and  data  base  change  SPRs. 
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-32.  Susceptibility  of  Project  3 Errors  to  Preventive  and  Detective  Techniques 
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standard.  In  the  computational  category  these  errors  fell  into  the 
following  detailed  categories: 

t computation  of  physical  and  logical  entries 

• computations  related  to  indices  and  indexing  sche^i 

e calculating  and  converting  time 

e units  conversion 

Logical  categories  that  might  be  prevented  through  desir.i  standards 

v/ore 

e handling  endless  loops 

• 'iMc  necessary  to  check  for  data  conditions  and  fid’,) 

.ntings 

To  varying  dtKjrees  similar  design  standards  are  possible  for  preven- 
tion of  i/0,  data  handling,  and  interface  errors.  Detailed  error  categories 
for  these  are  as  follows: 

-I/O 

• tape  and  data  card  output  formats 

• debug  output 

v error  message  content 

-Data  Handling 

• setting/using  internal  variables 

• data  chaining 

• complete  processing  of  input  data 

-Interfaces 

• calling  sequences 

• communication  through  correct  data  block 
4, 7. 1.2  Coding  Standards 

An  expansion  of  existing  project  coding  standards  was  judged  to  have 
as  high  as  26.3  percent  preventive  effect  on  the  Project  3 code  change 
errors.  Here  the  greatest  benefit*  would  be  In  the  data  handling,  logical. 


★ # 

Note  that  standards  for  structured  programing  and  structured  design  are 

not  addressed  here  nor  under  design  standards,  respectively.  Although 
such  standards  would  definitely  enhance  understanciability,  testability, 
maintainability,  etc.,  quantitative  benefits  were  undeterminable.  Use  of 
such' standards  is  believed  to  be  part  of  the  Project  5 success. 
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I/O,  and  Interface  areas.  As  might  be  expected  with  a Compool  and  existing 
coding  standards,  errors  in  the  global  and  local  variable  definition  cate- 
gories were  low,  less  than  0.1  percent  of  all  errors.  Susceptible 
detailed  categories  are  as  follows; 

-Data  Handling 

• initialization  and  updating  of  flags  and  indices 

• bit  manipulation 

• floating  point  to  integer  conversion 

• definition  of  local  (non-Compool ) variables 

• data  packing/unpacking 

-Logical 

• limit  determination 

• loops 

• tests  of  indices  and  flags 

-I/O 

• error  message  formats 

• output  field  ske 

• output  header  placement 

• line  count/page  ejects 

-Interfaces 

• routine  calling  sequences 
4. 7.1.3  Design  and  Code  Inspections 

The  design  and  code  -’nspections  described  here  are  basically  those 
suggested  by  Fagan  in  [13].  In  this  report  inspections  are  described  as 
fairly  formal  "walk-throughs"  of,  first,  the  design  and  later  the  cede. 
These  inspections  consist  of  a presentation  of  the  design  or  code  by  the 
person(s)  responsible  for  each,  respectively,  to  a group  of  other  project 
performers  represented  by  those  knowledgeable  in  design,  coding,  testing, 
interfacing  software,  and  the  data  base.  Also,  there  is  an  inspection 
chairman  who  keeps  the  meeting  going  according  to  the  agenda,  which  is  fade 
up  of  standard  topics  and  topics  made  obvious  by  errors  that  are  found  to 
be  most:  common.  These  inspections  are  truly  working  meetings  between 
project  performers  who  are  creating  the  product  and  should  not  be  con- 
fused with  the  larger  and  generally  higher  level  preliminary  and  critical 
design  reviews  (PDRs  and  CDRs)  defined  in  MIL-STD-1521[14],  These  should 
still  be  held,  even  with  the  design  and  code  inspections. 
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Inspection  techniques  pro;ide  a significant  potential  fgr  error 
prevention,  38  percent  in  the  IBM  experience  [13]  and  an  estimated 
62.7  percent  of  all  errors  using  the  Project  3 data.  This  Is  because 
inspections  force  a type  of  conwuntcation  at  the  worker  level  which  occurs 
randomly  and  in  varying  degrees  of  formality  on  most  projects.  Similar 
inspections  are  being  conducted  with  apparent  success  in  selected  groups 
on  Project  5 where  it  has  been  found  that  the  essential  ingredients  needed 
for  these  inspections  are  al  getting  all  the  right  people  together  at  the 
same  time,  b)  defining  meaningful  agenda  items  tailored  to  the  specific 
project,  and  possibly  most  important  c)  allocating  time  in  the  schedule  to 
accomplish  inspections. 


4. 7. 1.4 


and  Algorithm  Testing 


These  detective  techniques  are  suggested  for  use  during  development 
testing,  i.e.,  the  testing  done  by  the  programmer  subsequent  to  achieving 
an  error-free  compilation.  Although  not  new,  these  techniques  can  provide 
the  greatest  single  error  removal  lever  in  the  development  cycle  because 
tests  are  performed  on  units  of  code,  usually  the  routine,  sufficiently 
small  so  that  testing  can  be  very  thorough.  In  fact,  on  Project  5,  with 
its  design  and  coding  standards  for  maximum  routine  size  of  100  executable 
statements,  it  is  possible  to  execute  all  paths  during  unit  testing*.  Also, 
unit  tests  are  performed  by  the  developers,  who  know  best  how  the  unit  works 
and  where  the  potential  errors  are.  Using  the  Project  3 data,  it  was 
determined  that  72.9  percent  of  all  errors  found  were  susceptible  to  a com- 
bination of  path,  functional  capability,  and  data  extremes  testing.  (This 
assumes  enough  time  to  accomplish  these  tasks  in  detail.) 

Table  4-33  oresents  the  percentage  breakdown  of  this  functionally 
oriented  testing  into  the  20  Project  3 major  categories  and  shows  that 
72.9  percent  of  the  errors  could  have  been  detected  with  this  combined  test 
strategy.  It  also  shows  in  a column  headed  DSCT,  that  51.1  percent  of  the 
errors  would  require  consideration  of  data  singularities  and  extremes  in  the 
testing.  These  results  are  particularly  interesting  in  light  of  discussions 
of  the  mathematical  theory  of  software  reliability,  Sections  5.0  and  6.0. 

A functional  capability  list  (FC>.)  is  a detailed  list  of  things  a 
portion  of  the  code  must  do.  These  lists  may  be  compiled  for  virtually  any 
level  of  modularization,  but  the  level  suggested  here  is  the  unit  or  routine 
level.  Ideally,  the  functiona7  capabilities  would  be  a further  breakdown 
of  and  traceable  to  software  requirements.  Items  on  the  list  become  the 
things  that  testing  of  the  unit  must  demonstrate,  and  for  each  item  test 
success  criteria  and  expected  test  results  may  be  specified.  In  conjunction 


*A  productivity  increase  of  24  percent  is  also  reported. 
^Functional  Capability  List 
k*Data  Singularity  and  Extremes  Testing 

+Where  a path  through  a loop  is  defined  so  that  it  includes 
at  least  one  traversal  of  a loop. 

4-165 


» 

I 

i 

1 

\ 


t 


with  the  testing  of  specific  functional  capabilities.  It  also  Is  possible 
to  accomplish  the  following  generic  test  objectives. 

Logical  Objectives 

9 Execute  every  coded  statement  In  the  software  at  least 
once,  ami  execute  every  source  code  branch  at  least 
once 

• Utilize  every  entry  point  at  least  once 

• Utilize  every  exit  point  at  least  once 

• Exercise  and  validate  every  error  message  at  least  once 

• Verify  that  all  decision  points  within  every  subroutine 
ere  executed  properly 

Computational  Objectives 

• Verify  every  computation  using  legal  Input  values 

• Verify  proper  handling  of  extreme  values  (maximum  and 
minimum),  singular  values,  and  out-of-bounds  values 
for  every  computation 

• Verify  proper  handling  of  missing  input  data  associated 
with  every  computation 

(The  verification  method,  e.g.,  hand  calculation  versus  use  of  results  from 
similar  but  independent  programs,  may  vary  with  the  specific  application.) 

Data  Handling  and  I/O  Objectives 

• Verify  that  input  data  are  obtained  from  the  proper 
location 

• Verify  that  output  data  are  stored  In  the  proper 
location  and  format 

• Verify  that  every  data  conversion  is  correctly  performed 

• Verify  that  incorrect  data  are  properly  handled 

• Verify  that  data  are  not  lost  nor  destroyed 

Note  the  repeated  reference  to  test  data  in  this  strategy.  The  tech- 
nique of  testing  with  data  singularities  and  extremes,  as  well  as  typical 
values,  reoresentative  of  actual  operational  values  has  been  termed  DSET 
here.  An  obvious  benefit  of  this  type  of  testing  is  the  early  attention 
focused  on  an  operational  data  base  for  use  in  validation  or  operational 
type  testing  at  the  subsystem  or  system  level.  Project  5 application  of 
this  technique  has  resulted  in  early  identification  of  operational  data 
base  problems. 
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Algorithm  string  testing  is  a torn  of  simulation  Involving  duplication 
of  algorithms  outside  «.he  modular  structure  of  the  code  in  order  to  test 
data  flow,  sensitivity  to  input  data,  timing  and  accuracy,  and  facilitate 
examination  of  algorithms  In  great  detail. 

4. 7. 1.5  Integration  Testing 

Integration  and  integration  testing  have  always  been  difficult  tG 
accomplish  due  tc  schedule  pressures  and  the  need  to  begin  ;alidation  test- 
ing. Examining  the  Project  3 data  again,  an  estimated  46.1  percent  of  the 
errors  found  in  validation  testing  may  have  been  visible  through  testing 
directed  at  routine/routine,  routine/data  base,  and  subsystem  level  inter- 
faces. Regardless  of  the  magnitude  of  the  improvement  percentagewise,  the 
symptoms  of  interface  e?*rors  are  particularly  troublesome  to  validation 
testing  because  of  their  severity  (e.g.,  shorts,  no-loads,  etc.).  A marked 
improvement  would  be  to  schedule  (even  at  the  expense  of  validation  test 
time)  a dedicated  integration  period  with  test  cases  specifically  designed 
to  detect  interface  errors.  The  test  objective  would  be  to  make  sure  that 
all  elements  of  the  code  load  and  cycle  together  and  that  the  software 
passes  data  correctly  from  start  to  finish.  The  implication  that  integra- 
tion testing  plays  down  investigation  of  all  requirements  except  interface 
requirements  is  intended.  Validation  test  procedures  designed  to  demon- 
strate functional  and  performance  requirements  are  typically  poorly  suited 
for  efficient  interface  testing. 

4. 7. 1.6  Validation.  System  Integration,  and  Operational  Demonstration 

The  importance  of  subsystem  and  system  level  testing  is  obvious  from 
what  can  be  seen  in  the  error  data.  Testing  to  requirements  (validation 
and  acceptance  testing),  assuring  that  the  software  performs  correctly  when 
played  together  with  the  total  system  (system  integration  testing),  and 
operational  exercises  have  their  own  specific  test  objectives,  yet  ucn 
uncovers  errors  that  should  have  been  discovered  earlier.  This  appears  to 
be  the  rule  in  real-world  situations,  regardless  of  the  amount  of  planning 
and  actual  testing  that  may  have  gone  on  before.  Using  Project  3 as  a sam- 
ple, only  about  27  percert  of  the  total  number  of  code  change  errors  that 
were  found  during  the  subsystem  and  system  level  testing  phases  named  above 
were  found  when  they  should  have  been  found. 

It  is  apparently  the  plight  of  the  software  developer  to  be  behind 
schedule.  Typically  he  has  a lot  of  things  going  against  him,  things  over 
which  he  has  little  or  no  control,  e.g.,  resource  availability,  changing 
external  requirements,  and  unforeseen  technical  problems.  The  result  is 
that  the  schedule  is  compressed  in  one  fashion  or  other,  and  the  thorough- 
ness of  testing  is  usually  the  thing  that  suffers.  Testing  is  the  last 
Item  in  the  schedule.  Although  both  thorough  routine  level  development 
testing,  and  function  and  subsystem  level  interface  testing  were  planned 
for  Project  3,  the  schedule  remained  inflexible  while  the  technical  problems 
continued  to  occur.  Forcing  subsystem  and  system  level  validation  testing 
by  an  independent  test  group*  on  a scheduled  date  actually  had  a positive 


*This  group  had  liberal  help  from  the  software  development  group  in  the 
areas  of  error  correction  and  test  output  review. 
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effect  on  the  progress  and  quality  oi  the  software  product.  It  tended  tn 
force  interface  testing  to  occur,  and  the  additional  personnel  Increased 
the  number  of  software  problem  reports  writte.:  par  unit  of  time  (See 
Section  4.S.6),  The  penalty  that  had  to  be  paid  was  the  inefficiency  of 
discovering  errors  which  should  have  been  found  in  earlier  test  phases 
through  execution  of  higher  level  tests  designed  to  demonstrate  satisfaction 
of  requirements  in  the  operational  environment.  On  the  positive  side,  this 
higher  level  testing  was  the  user's  final  filter  before  accepting  delivery 
of  the  product.  It  allowed  him  to  become  involved  in  operational -like  test- 
ing which  also  served  as  a training  session  for  operational  crews.  And, 
the  error  trends  seen  in  preoperational  testing  are  predictors  id  terns  o* 
numbers  of  operational  errors*.  Also,  the  type  of  errors  encountered  are 
much  like  those  experienced  during  operational  usage.  Irvolvefne.it  in  the 
error  identification  and  removal  process  of  preoperational  system  rating 
was  also  an  Important  facet  of  user  training. 

4.7.2  Tools 

Tools  Hay  be  categorized  according  tn  their  roles  in  eliminating  spe- 
cific software  errors  as  being  preventive  (e.g.,  an  algorithmic  simulator) 
or  detective  (e.g.,  an  interface  checker).  A third  category  exists  for 
those  tools  which  perform  a support  role  (e.g.,  a dynamic  path  analyzer) 
where  the  tool  supports  some  development  or  test  technique.  It  will  be 
seen  that  the  benefit  of  tools  is  fairly  easy  to  quantify  for  the  first  two 
categories  and  less  sc  for  the  support  tools,  although  it  is  this  third 
category  which  may  represent  the  greatest  lever  in  creating  an  error-free 
software  product  because  of  the  broad  range  of  errors  nade  visible.  For 
example,  aw  algorithmic  simulator  would  be  effective  in  preventing  specific 
computational  errors*  maybe  as  high  as  74  percent  of  the  Project  3 computa- 
tional errors  (6.6  percent  of  all  errors).  A dynamic  path  analyzer,  a 
support  tool,  is  an  obviou*  benefit  to  anyone  interested  in  thorough  test- 
ing, but  this  benefit  is  much  more  difficult  to  quantify.  The  dynamic  path 
analyzer  in  this  example  would  typically  be  used  to  support  functional  capa- 
bility testing  at  tiie  routine  level  where  the  scope  of  the  error  detection 
activities  is  not  limited  t_  one  detailed  or  even  general  category  of  error, 
in  this  case  72.9  percent  of  all_  errors.  Without  explicit  (and  controlled) 
experimental  evidence  the  best  that  can  be  said  is  that  these  support  tools 
greatly  assist  in  performance  of  techniques  with  a potential  for  dealing 
with  some  with  seme  types  of  errors. 

In  the  discussion  which  follows  tools  representing  a benefit  in  the 
Project  3 development  and  test  environment  are  described.  A summary  is  pro- 
vided in  Table  4-34  for  selected  preventive  end  detective  tools  followed 
by  short  descriptions  of  these  tools  and  some  suggested  support  tools. 

4.7.2. 1 Simulations 

Simulations  designed  to  investigate  logical  and  computational  aspects 
of  the  software  system  would  have  addressed  9.5  percent  and  5.8  percent  of 
the  Project  3 errors,  respectively.  A simulation  of  system  data  flow,  with 


* 


Correlation  coefficient  of  0.92  observed  for  Project  3 (See  Section  4.2.4). 
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consideration  of  the  internal  and  user  interfaces  wouM  have  helped  with 
another  3,9  percent  of  all  errors.  Computational  simulations  might  also 
have  helped  in  early  detent  {nation  of  preset  data  base  values,,  accounting 
for  0.8  percent  of  all  errors.  Here  it  should  be  pointed  out  that  the 
percentage  for  Project  3 preset  data  btse  errors  may  be  very  low  compared 
to  other  systems,  especially  real  time  processors  and  systems  with  large 
hardware/software  interfaces.  In  these  systems,  like  Project  5,  simulations 
can  be  instrumental  in  early  data  base  tuning. 

4. 7. 2. 2 Design  Languages 

This  generic  group  of  tools  represents  the  largest  benefit  of  any  of 
the  tools  listed  in  Table  4-34.  This  might  be  expected  from  the  finding 
that,  approximately  64  percent  of  the  Project  3 errors  were  design  errors. 

The  same  percentage  existed  for  Project  2.  The  chief  benefit  «f  design 
languages  is  that  they  specifically  address  program  logic,  including  the 

frequent  logic  error,  missing  logic  or  condition  tests*  which  alone 
accounted  for  12.3  percent  of  all  code  change  errors  (47.6  percent  of  the 
logic  errors).  An  estimated  31.7  percent  of  all  errors  were  susceptible  to 
prevention  through  use  of  design  language  tools. 

4. 7.2. 3 Code  Standards  Auditor 

The  chief  benefit  of  this  detective  tool  is  that  It  enforces  adherence 
to  coding  standards.  In  the  Project  5 application  this  Product  Assurance 
tool  has  seen  eventual  acceptance  by  all  development  groups,  to  the  point 
where  thev  use  it  themselves  to  verify  compliance  of  code  with  project 
standards  . The  result  ir,  code  that  is  easy  to  read,  is  well  documented,  and 
looks  as  if  o.ie  person  wrote  it  all.  In  addition  to  the  improvements  In 
understandability  and  code  documentation,  specific  coding  errors  which  can 
be  precluded  through  adherence  to  coding  standards  could  be  eliminated  as 
part  of  the  pre-compilation  process.  An  estimated  26.3  percent  of  the 
Project  3 errors  were  susceptible  to  prevention  with  project  specific  coding 
standards. 

4. 7. 2. 4 Units  Consistency  Analyzer 

This  tool  checks  consistency  (and  compatibility)  of  stated  parameter 
units  within  equations  and  prespecified  standard  units.  Such  a tool  could 
also  be  expanded  to  check  global  variable  definitions  against  accepted 
project  standards.  Although  errors  detectable  by  such  a to<vi  represent  a 
small  portion  of  the  Project  3 data  (2.0  percent)  this  tool  is  particularly 
attractive  because  It  could  virtually  eliminate  such  errors  while  at  the 
same  time  forcing  developers  to  think  in  great  detail  about  data  definitions, 

4. 7. 2. 5 Set/Use  Checker 

. This  tool's  primary  function  is  to  identify  where  certain  data  items 
are  set  and  which  routines  use  them.  It  also  provides  information  on  which 
routines  call  or  are  called  by  other  routines.  It  also  locates  and 


★ 

Compliance  with  Project  5 standards  can  be  considered  very  good,  in  excess 
of  %%,  Those  not  complying  must  obtain  a waiver  based  on  technical  merit. 


I 


4-170 


identifies  variables,  both  global  and  local,  that  are  declared  but  not 
set/used  and  variables  that  are  used  but  not  declared.  This  tool  (a 
similar  one  was  available  for  Projects  2 and  3)  is  extremely  useful  in 
detecting  date  handling  errors. 

4. 7. 2. 6 Compatibility  Checker 

Compatibility  checker  is  a general  term  for  a tool  which  checks  for 
compatibility  between  two  elements  of  the  code,  e.g.,  calling  sequence 
definitions.  As  might  be  expected,  most  of  these  errors  were  in  the  inter- 
face ategories;  an  estimated  10.7  percent  of  all  errors  could  be  checked 
for  compatibility  with  other  parts  of  the  code. 

4. 7.2.7  Dynamic  Path  Analyzer 

A dynamic  path  analyzer  identifies  and  instruments  segments  and  paths 
ir  the  code  anu  monitors  execution  of  the  code  to  determine  which  portions 
of  the  code  were  exercised.  This  support  tool  is  particularly  valuable  in 

• determining  a measure  of  test  thoroughness, 

• minimizing  test  redundancy, 

• relating  specific  routine  capabilities  and  allocated  software 
requirements  to  segments  and  paths  in  the  code, 

• detraining  the  extent  of  retest  after  a problem  is  fixed. 

It  was  noted  earlier  that  such  a tool  should  be  applied  during  detailed 
development  testing  at  the  routine  level,  the  test  phase  that  has  the 
greatest  potential  for  discovering  software  errors.  It  can  serve  a 
valuable  function  during  integration  and  system  level  testing  in  the  deter- 
mination of  test  thoroughness.  It  is  also  particularly  useful  in  confidence 
testing  subsequent  to  update  of  the  software.  In  confidence  testing  the 
objective  is  to  test  new  code  as  well  as  demonstrate  the  fact  that  old 
capabilities  have-  not  been  altered. 

4. 7.2.8  Test  Data  Generator 

A thorough  development  test  strategy  involving  use  of  a dynamic  path 
analyzer  would  require  that  all  segments  (in-line  code  between  branch  points) 
and  all  branches  be  exercised.  For  some  short,  non -complex  routines  it  is 
even  possible  to  exercise  all  paths,  a Project  5 routine-level  test  objec- 
tive. One  of  the  penalties  of  such  a strategy,  however,  is  that  the 
developer  has  to  carefu-ly  examine  his  code  to  determine  data  conditions 
that  force  execution  of  specific  segments  or  paths.  A tool  to  automate  this 
data  generation  process  could  greatly  reduce  the  amount  of  work  required  to 
set  up  and  execute  tests. 

4. 7.2.9  Test  Case  Execution  Monitor 

For  validation  and  system  level  testing  maintenance  of  test  configura- 
tions and  creation  of  test  records  represents  a considerable  amount  of  work 
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when  done  manually.  On  Project  3 there  were  in  excess  of  200  test  cases 
during  formal  subsystem  and  system  level  testing.  Each  was  executed  a 
number-  of  times* **  and  each  required  that  complete  records  be  kept  concerning 
characteristics  cf  the  test  environment  and  the  success  or  failure  of  the 
test. 

A test  case  execution  monitor  would  load  r,nd  execute  test  decU  from 
tape  or  disk  and  could  he  capable  of  recording 

• test  case  and  software,  configurations 

• the  number  of  test  executions  and  whether  they  were 
successful  or  not 

# status  of  applicable  SPfts 

* satisfaction  of  software  requirements 

Presumably,  thvs  monitor  might  also  be  capable  of  inserting  debug  code  at 
instrumented  points  in  the  source  program,  compiling,  and  executing  with 
prespecified  data.  Such  a tool  is  being  used  effectively  by  Project  5 test 
personnel . 

4.7.2.10  Generalized  Data  Base  Construction  Tool 


Software  testers  have  always  been  plagued  by  the  need  to  have  test 
data  early  in  the  $|$velopment  cycle.  Responses  to  requests  for  GFE  data 
are  generally  late  , and  the  development  groups  are  reduced  to  creating 
data  bases  by  hand,  a task  of  purest  drudgery  which  could  be  greatly  reduced 
by  a tool  capable  of  generating  tables  and  arrays  of  data***  for  development 
test  purposes.  This  support  tool  should  have  the  capabilities  of  generating 
values  between  extremes,  creating  singularities  such  as  discontinuities* 
and  creating  combinations  of  settings.  It  should  also  have  the  capability 
of  extracting  and  reformatting  data  from  blocks  on  other  existing  data 
bases.  Creation  of  large  numbers  of  entries  for  simulating  maximum  data 
load  conditions,  as  well  as  tape  and  punched  card  output,  should  be  possible 
with  this  tool.  The  benefit  of  such  a tool  would  be  a freeing  of  the 
development  personnel  to  spend  more  time  debugging  and  testing  the  code 
prior  to  initiation  of  higher  level  subsystem  and  system  testing. 

4.7.2.11  Data  Base  Comparator 


One  tool  available  to  the  Project  3 test  group  which  proved  extremely 
useful  as  a test  support  tool  was  a data  base  comparator.  This  tool 


* 

Actual  execution  records  were  part  of  the  "perishable"  data  that  were 
destroyed  after  project  completion.  A conservative  estimate  is  3 exe- 
cutions per  test  case. 

** 

Government  suppliers  of  these  data  bases  also  lack  tools. 

*** 

Operational  data  bases  would  still  be  required  GFE  for  system  type 
testing. 


4-172 


compared  data  bases,  input  as  well  as  output,  element  for  element  and 
flagged  differences  in  values.  It  was  extremely  useful  in  demorstrating 
that  changes  to  the  code  (e.g.,  use  of  debug  code  for  test  purposes)  did 
not  alter  the  software's  performance. 

4.7.1  Reflections  on  the  Project  3 findings 

The  obvious  message  from  the  precoding  analysis  of  Project  5 data  is 
that  more  testing  should  have  been  done  prior  to  initiation  of  formal  sub- 
system and  system  level  testing.  An  estimated  72.9  percent  of  the  errors 
resulting  in  a code  change  should  have  been  found  through  thorough  routine 
level,  tools-aided  testing. 

The  data  also  inspire  a number  of  questions.  For  instance,  was  detailed 
routine  level  testing  planned,  and  if  so,  was  it  accomplished?  Routine 
level  testing  was  planned,  although  not  in  the  detail  suggested  in  the 
preceding  paragraphs.  Function  level  integration  testing  was  also  planned. 
However,  neither  type  of  testing  was  completely  accomplished,  nor  could 
they  have  been  in  the  seven  month  schedule  prior  to  delivery  of  the  software 
to  an  independent  validation  test  team.  In  the  real  world  of  tight  sched- 
ules, the  activity  at  the  end  of  the  cycle  suffers  when  time  is  limited, 
and  there  dimply  wasn't  enough  time  to  complete  routine  level  testing. 

One  might  also  ask  if  the  inability  to  complete  routine  level  testing 
was  foreseen,  and  what  was  done  about  it.  The  situation  was  recognized 
fairly  early  in  the  development  cycle.  Evidence  of  this  was  in  the  devel- 
opers' weekly  status  reports  to  management  and  also  in  the  data  available 
through  recordkeeping  activities  of  the  configuration  maragement  organiza- 
tion. AUhough  a number  of  things  were  done  to  he i p the  situation,  the 
fact  that  formal  subsystem  and  system  tests  were  net  strictly  limited  to 
demonstration  of  requirements  probably  did  more  to  assure  an  acceptable 
product  at  delivery  time  than  any  other  single  thing.  It  was  this  objective 
of  testing  as  many  capabilities  as  possible,  along  with  stated  requirements, 
that  made  it  possible  to  detect  so  many  non-requirements  related  errors. 

The  next  question  might  be  how  error  free  was  the  software  in  opera- 
tional use?  Using  a simplistic  ratio  of  the  number  of  code  change  errors 
to  software  size,  the  operational  software  experienced  1.7  errors  per 
thousand  total  source  statements*  (0.61  errors  per  machine  language  instruc- 
tion) for  intermittent  operation  over  a period  of  approximately  one  year. 

A rate  of  17.5  errors  per  thousand  total  source  statements  (6.3  errors  per 
machine  language  instruction)  was  experienced  during  preoperational  testing. 
The  operational  suftware  was  delivered  on  time  and  allowed  the  user  to 
satisfy  his  operational  needs.  Although  not  a very  efficient  approach,  the 
combination  of  validation,  acceptance,  system  integration,  and  operational 
testing  wos  effective  in  the  "catch  up"  detection  of  errors  that  should 
have  been  detected  at  the  routine  level.  A word  of  warning,  however;  all 


Total  source  statements  include  executable  as  well  as  non-executable 
statements,  but  no  comments. 


software  systems  are  not  as  visible  in  a testing  sense  as  the  Project  3 
batch  mode*  higher  order  language  software.  To  depend  on  high  level  cest 
for  some  software  systems,  e.g.,  real  time  systems,  could  be  a fatal  mis- 
take if  the  nature  of  the  software  being  tested  precludes  detection  of 
errors  through  introduction  of  debug  techniques. 

4.7.4  Evaluation  of  Project  S Test  Techniques 

Although  comparison  of  projects  is  not  intended  here,  Project  5 pre- 
sents an  opportunity  to  examine  the  success  of  detailed  routine  level  test- 
ing in  a development  environment  where  use  of  some  of  the  tools  and  tech- 
niques described  earlier  are  a reality.* 

For  the  portion  of  the  Project  5 system  described  as  the  applications 
software,  there  is  a routine  level  test  requirement  that  not  only  all 
branches  be  exercised  but  all  paths  as  well.**  Project  design  and  coding 
standards  make  this  latter  requirement  realizable.  In  particular,  routines 
can  be  no  larger  than  100  executable  statements  in  length,  and  structural*** 
simplicity  is  encouraged  to  lower  execution  times. 

Although  the  software  is  being  developed  in  a top-down,  Incremental 
approach,  testing  within  each  increment  is  a combination  of  top-down  and 
bottom-up*  Testing  is  done  by  the  developers  at  the  routine  level  using 
a dynamic  path  analyzer  tool  (PACE)  to  aid  in  accomplishing  test  require- 
ments listed  above.  Other  test  objectives  include  testing  to  functional 
capabilities,  testing  with  data  singularities  and  extremes,  and  finally, 
integration  of  routines  into  functional  groups  called  tasks.  Functional 
capabilities  are  traceable  to  specific  paths  through  the  software  and  to 
the  software  requirements.  An  Important  feature  about  Project  5 is  that 
time  has  been  scheduled  for  this  very  detailed  testing  and  tools  have 
been  developed  to  aid  in  the  test  process. 

Once  the  developers  complete  routine  level  and  task  integration  test- 
ing, the  software  is  turned  over  to  an  independent  test  organization  for 
process  integration,  during  which  the  software  plays  against  the  real-time 
simulator  and  under  control  of  a real-time  operating  system.  It  is  in 
this  test  phase  that  problem  reports  are  created;  these  form  the  data 
being  used  In  this  study,**** 

Using  273  problem  reports  generated  during  process  integration  testing 
of  the  real-time  applications  software,  an  attempt  was  made  to  determine 
if  the  routine  level  tesS  strategy  is  effective  in  detecting  errors  it  was 
designed  to  detect.  Results  ware  divided  into  three  categories:  1)  errors 
that  were  found  where  they  should  have  been  found,  2)  errors  where  earlier 
detection  was  debatable,  and  3)  errors  tnat  slipped  through  routine  level 


Comparison  to  Project  3 results  is  tempting,  but  the  differences  between 
two  two  projects  make  this  unwise. 

**Where  a path  through  a loop  is  defined  so  that  it  includes  at  least  one 
traversal  of  a loop. 

Standards  also  require  that  the  design  and  code  be  structured. 

^ ^ 

Formal  validation  and  operational  performance  testing  will  eventually 
L 1 done  at  the  end  of  the  top-down  development  cycle. 
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testing  and  were  discovered  in  process  level  integration  testing.  Results 
are  presented  in  Table  4-35  for  each  of  the  Project  5 major  error  cate- 
gories and  for  the  total  sample  of  problem  reports. 

The  "debatable”  classification  comes  about  because  of  the  extreme 
importance  of  having  the  simulator  available  to  detect  some  types  of 
errors.  That  is,  to  detect  these  errors  earlier  might  be  possible,  but 
generally  impractical.*  For  this  reason  the  percentage  of  debatable 
errors  best  fits  in  with  the  errors  that  were  detected  when  they  should 
have  been  detected. 

Results  show  a significant  improvement  over  Project  3 routine  level  test 
effectiveness  with  only  15,7  percent  of  all  the  errors  slipping  through. 


Table  4-35.  Effectiveness  of  the  Project  5 Test  Strategy 


Project  5 Major 
Error  Categories 

Computational  (A) 
logic  (B) 

Data  Ir.put/Output  (C/E) 
Data  Handling  (D) 
Interface  (F) 

Data  Definition  (G) 

Data  Base  (H) 

Other  (J) 


Totals 


Occurrences 

Errors  Found  When  They 
Should  Have?  (Percent) 

Yes 

Debatable 

No 

42 

62.0 

19.0 

19.0 

52 

48.1 

23.1 

28.8 

21 

76.2 

9.5 

14.3 

29 

62.1 

20.7 

17.2 

27 

96.3 

0 

3.7 

19 

47.4 

36.8 

15.8 

54 

79.6 

9.3 

11.1 

29 

65.5 

27.6 

6.9 

273 

66.7 

17.6 

15.7 

★ 

The  scope  of  this  study  didn't  allow  individual  resolution  of  each  of 
these  problem  reports. 
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Looking  at  individual  error  types,  we  see  that.  the  logical  errors  had  the 
highest  slippage  at  28.8  percent  and  interface  errors  had  the  lowest.  sV'p- 
pag*  at  c,.7  percent.  Computational  and  data  handling  categories  ware  next 
highest  yiith  19.0  percent  and  17.2  percent,  respectively.  The  relatively 
low  slippage  o*  data  base  errors  stresses  the  importance  of  integration 
testing  in  date  base  tuning  for  Project  5. 

Detailed  error  categories  that  slipped  past  rjutine  level  development 
testing  were  the  following: 

Computational  Errors 

A_1Q0  Incorrect  operand  in  equation 

A"60G  Incorrect/inaccurate  equation  used 

A“800  Missing  computation 

Logic  Errors 

B^lOO  Incorrect  operand  in  logical  expression 
8300  Missing  logic  or  condition  test 

B”600  Loop  iterated  incorrect  number  of  times 
(including  endless  loop) 

Data  I/O  Errors 

C_300  Incorrect  input  format 

E”500  Incomplete  or  missing  output 

Data  Handling  Errors 

D_100  Data  initialization  n?t  done 

D_2Q0  Data  initialization  done  improperly 

D_500  Bit  manipulation  done  incorrectly 

Interface  Errors 

F 500  Software/data  base  interface  error 

Data  Definition  Errors 

GJlOO  Data  not  properly  defined/dimensioned 

Data  Base  Errors 

H_100  Data  not  Initialized  in  data  base 

ii_200  Data  initialized  to  incorrect  value 

Other  Errors 

J_900  Software  not  compatible  with  project  standards 
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Evidence  of  the  improved  effectiveness  of  Project  5 development  and 
routine  level  test  strategies  is  also  seen  in  the  absence  of  certain  types 
of  errors  in  the  integration  test  data.  FGr  instance,  errors  in  calculat- 
ing indices,  entry  numbers,  and  total  number  of  entries  (representing 
26.3  percent  of  the  Project  3 computational  errors  or  3.2  percent  of  all 
errors)  are  virtually  nonexistent  in  the  Project  5 data.  .Errors  associated 
with  reading  and  writing  data  in  the  wrong  location  are  also  virtually 
nonexistent  for  Project  5;  Project  3 errors  of  this  type  represented  12  per- 
cent of  the  data  handling  errors  or  2.2  percent  of  all  errors.  It  is  felt 
that  these  types  of  errors  are  being  detected  In  the  routine  level  testing. 

At  the  same  time  some  error  types  continue  to  be  predominant  for  both 
projects,  e.g.,  missing  logic  and  data  initialization  errors.  A summary 
of  percentages  for  these  categories  is  presented  below  in  Table  4-36. 


Table  4-36.  Sample  Error  Type  Similarities  Between  Projects  3 and  F 


Project  3 


Project  5* 


Error  Category 


Percent  of 
Major 
Category 


Percent  of 
All 

Errors 


Pei  cent  of 
Major 
Category 


Percent  of 
All 

Errors 


Missing  Logic 
Errors 


47.6 


12.3 


46.8 


8.0 


Data  Initialization 
Errors 


43.1 


7.7 


26.7 


2.9 


Although  this  type  of  comparison  is  tempting,  too  many  unknowns  exist 
to  pursue  further  investigation.  For  example,  one  project  is  operational, 
the  other  is  still  in  development,  i.e.,  Project  5 data  are  not  complete. 
And,  data  being  compared  are  from  different  types  of  testing.  Any  future 
comparisons  should  be  made  using  controlled  experimental  data.  However, 
based  on  software  size  and  the  apparent  volume  of  documented  errors  alone. 
Project  5 should,  at  completion,  compare  very  favorably  to  similar  measures 
taken  on  other  projects. 


t 


^Real-time  applications  software  only,  but  other  portions  of  the  Project  5 
software  exhibit  high  percentages  in  these  categories  as  well. 
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4.7,5  Conclusions 

The  following  conclusions  are  drawn  from  the  foregoing  examination 
of  software  errors  and  their  susceptibility  to  elimination  by  various 
tools  and  techniques. 

• To  be  of  mny  practical  use  in  developing  or  evaluating  tools 
and  techniques,  errors  have  to  be  categorized  in  considerable 
detail. 

• Detailed  error  categorization  can  provide  groundwork  for 
specification  of  requirements  for  tools.  Once  the  character- 
istics of  errors  most  frequently  encountered  are  defined,  it 
is  a short  step  to  a list  of  capabilities  a tool  must  have. 

• Error  data  may  be  used  to  justify  introduction  of  new  tech- 
niques or  to  direct  application  of  existing  techniques  in 
areas  known  to  need  attention. 

« Although  the  analysis  done  here  encompassed  error  data  docu- 
mented during  formal  subsystem  and  system  testing,  the 
approach  used  should  apply  equally  well  to  error  data  col- 
lected during  other  phases  of  the  software  development  proc- 
ess, including  requirements  specification,  design,  coding, 
and  maintenance  phases. 

Quantitative  results  given  here  should  be  viewed  r-.s  peculiar  to  one 
development  environment.  Although  major  categories  used  in  Projects  2 and  3 
are  quite  similar  to  those  used  in  Project  5,  the  detailed  characteristics 
of  errors  are  quite  different  owing  to  the  differences  in  operating  mode 
(batch  versus  real-time),  language  (JOVIAL  versus  FORTRAN),  and  development 
strategy  (one  time,  everything  at  once  versus  top  down),  just  to  name  a few. 

The  message  here  is  that  software  quality  (and  reliability)  can  be 
improved  through  increased  attention  to  detail  and  discipline  in  application 
of  development  and  test  techniques.  Tools  not  only  aid  in  accomplishing 
tasks  which  are  normally  done  manually  but  make  it  possible  to  accomplish 
tasks  which  would  not  be  done  at  all  if  the  only  way  possible  was  manual. 
Tools  are  possibly  the  only  way  to  achieve  the  needed  discipline  and  atten- 
tion to  detail  within  schedule  constraints  of  current  software  development 
schedules. 


) 


★ 

This  list  of  capabilities  should  still  be  subject  to  a critique  by  the 
eventual  users. 
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5.0  SOFTWARE  RELIABILITY  ffQOELS 

5.1  Introduction 

This  section  of  the  report  describes  proposed  models  of  software 
reliability  as  published  in  the  open  literature  and  government-sponsored 
reports.  Also  some  recent,  previously  unpublished,  progress  in  this  area 
by  TRW  is  presented  (5.2.4). 

The  term  "software  reliability  model"  mainly  refers  to  a mathe- 
matical model  constructed  for  the  purpose  of  assessing  the  reliability  of 
software  from  specified  parameters  which  are  either  assumed  known  or  are 
measured  from  observations  or  experiments  on  software.  It  can  therefore 
also  refer  to  the  mathematical  relationships  among  the  parameters  which 
are  considered  relevant  to  softw^r^  reliability,  and  not  specifically 
include  the  reliability  parameter  itself.  r-^rV'.  .the  relationship 
of  a logic  path  to  the  input  data  subset  exercising  that  logic  path  is  con- 
sidered relevant  to  measurement  of  reliability,  but  can  be  evaluated  in- 
dependently of  a reliability  measurement.  Error  rate  is  another  related 
parameter  which  has  useful  meaning  in  a real  time  continuously  operating 
system,  but  which  Is  related  to  software  reliability  only  Indirectly  (e.g. , 
by  assuming  an  exponential  distribution  of  time  between  failures). 

Another  type  of  software  rel lability  model  to  be  discussed  In 
this  section,  for  which  only  a statistical  relationship  to  reliability 
would  be  assumed,  Is  the  r.,t>  called  phenomenological  or  empirical  model.  In 
this  approach,  it  Is  attempted  to  quantitatively  evaluate  those  character- 
istics of  software  which  are  sensed  as  associated  with  high  reliability  or 

the  lack  of  it.  For  example,  "cy.'rolexUy"  labels  what  is  considered  to  be 
a characteristic  leading  to  low  reliability.  In  that  urror-proneness  (on 
the  part  of  the  programmer)  or  difficulty  of  finding  and  removing  errors 
are  believed  to  be  two  of  many  consequences  of  "complexity".  In  other 
terms,  the  phenomenological  model  of  software  reliability  attempts  to  deal 
with  those  parameters  which  when  appropriately  modified  will  tend  to  Im- 
prove software  reliability.  Section  4.3  discusses  some  of  the  studies 
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and  progress  made  in  use  of  this  approach  to  software  reliability. 

Software  generally  can  be  considered  as  a subsystem  of  a computer 
system,  Avizienis  [16]  presents  a clear  discussion  of  this  relationship 
in  terms  of  providing  system  design  features  to  assure  correct  execution 
of  programs  in  the  presence  of  faults  in  the  computer  system.  He  gees  on 
in  [16]  to  define  software  faults  as  "deviations  from  correct  program  execu- 
tion which  are  due  to  errors'*  occurring  during  the  translation  of  the 
original  specification  of  an  algorithm  to  the  program  being  executed", 

Fault  tolerance  can  be  designed  into  a computer  system  through  the  use  of 
redundant  hardware  or  software  features.  Software  can  be  designed  so  that 
certain  classes  of  software  faults  can  be  reduced  in  frequency  or  eliminated. 
The  latter  is  subject  matter  for  software  reliability  engineering;  and  It 
is  hoped  that  the  current  study  will  contribute  to  this  objective. 

Having  accepted  the  definition  of  "software  faults"  given  above, 
we  car.  now  attempt  a definition  of  software  reliability:  as  the  probability 
that  a software  fault  which  causes  deviation  from  required  output  by  more 
than  specified  tolerances,  in  a specified  environment,  does  not  occur  dur- 
ing a specified  exposure  period. 

Let  us  now  discuss  this  definition  In  some  detail. 

There  is  the  Implication  that  not  all  software  faults  occurring 
will  reduce  the  reliability,  only  those  which  cause  a "deviation  from  re- 
quired output  by  more  than  specified  tolerances." 

The  specified  environment  refers  to  the  input  data  description, 
which  will  subsequently  be  discussed  more  thoroughly,  and  the  state  of  the 
computer  system  during  program  execution.  The  latter  environment  will 
often  be  described  by  the  amount  of  fast  access  storage  (core)  available, 
but  also  will  depend  on  the  requirement  for  the  software  to  operate  in  the 
presence  of  hardware  faults.  Here  we  are  referring  to  features  incorporated 
in  the  software  design  (e.g.,  alternate  programs,  capability  of  checkpoint 


* In  the  following  discussion  the  words  "error"  or  "error  type"  are  de 
fined  as  a cause  of  a software  "fault";  conversely  the  fault  is  con- 
sidered as  the  manifestation  of  an  error  made  previously. 
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restart,  etc.)  which  must  work  rorrectly  when  specified  hardware  faults 
occur.  In  general,  operating  in  an  environment  not  included  in  the  speci- 
fications and  for  which  the  software  was  not  designed,  will  cause  a reduc- 
tion in  reliability. 

The  specified  exposure-  period  is  often  referred  to  as  the  “mission 
time."  The  concept  of  a time  interval  to  measure  software  performance 
seems  to  be  wost  appropriate  to  a real  time  situation,  in  which  the  number 
of  executions  of  any  given  program  and  the  state  of  the  data  ba3G  are  un- 
predictable as  of  each  initiation  of  a program  execution.  The  term  “opera- 
tional cycle”  or  "run"  is  most  apt  for  a so-called  batch  environment  in 
which  the  state  of  the  system,  including  the  data  base  is  known  at  the 
initiation  of  execution  of  a given  program.  In  general,  this  will  mean 
either  a return  of  all  storage  to  its  original  content  prior  to  re-executioi , 
or  a series  of  executions  which  changes  the  data  base  in  a predetermined 


manner. 


Discussion  of  Models 


The  discussion  of  well-documented  software  reliability  models, 
such  as  those  attributed  to  Shooman,  O'elinski  and  Moranda,  Wolverton  and 
Schick,  and  others,  is  kept  to  a minimum  in  this  report.  Hopefully,  all 
of  the  essential  oaracterf sties  are  described  so  that  the  significance 
and  relationship  to  this  study  can  be  ascertained  without  going  to  the 
original  reference. 

5.2.1  Shooi.an's  Mouel  Cl 7] 

The  application  of  this  model  for  reliability  estimation  depends 
upon  several  assumptions,  the  most  significant  of  which  is  the  requirement 
to  use  a "system  exerciser  program",  which  is  not  completely  defined  in 
[17],  The  remainder  of  the  assumptions  are  statistical  in  nature  and  do 
not  reflect  unique  properties  of  software.  These  latter  assumptions  as 
given  in  [17]  can  be  briefly  listed  as  follows: 


t 


Al,  At  the  start  of  system  integration,  there  a»is  Ey  errors 
present  in  the  software.  Beginning  at  this  point,  <iebug- 
ging  time  ? is  counted,  which  is  the  time  spent  in  test 
ferreting  out  errors,  checkout,  etc.,  exclusive  of  operating 
time  t.  faring  the  debugging  time  x,  a number  cc(t)  of 
errors  pe-  machine?  langutge  instruction  are  reeved.  Thus 
the  number  of  errors  per  machine  language  instruction  remain- 
ing after  months  of  debugging  is: 

cf(x)  ■ Ey/iy  - cc(t) 

where  ly  is  the  total  number  of  machine  language  instruc- 
tions (assumed  constant). 

A2.  It  is  then  assumed  that  the  hazard  function  z(t)  is  propor- 
tional to  the  number  of  errors  remaining  in  the  software 
after  debugging  time  t,  i e.: 

z(t)  * Cer(r) 

and  if  t operational  hours  are  then  counted  from  the  point 
at  which  t * 0,  and  t stays  fixed,  then  the  reliability 
function,  or  probability  of  no  failure  during  the  operation 
time  Interval  (0,t)  is: 


R(t,t)  - exp(  -C(  ET/IT  - cc(t)  )t  ) 


If  k tests  are  made  following  periods  (0,-^),  (0,t2), 

(0,Tk)  of  debugging  where  x^  < x2  < — < xk,  it  can  be 
shown  that  if  k 2,  two  equations  fer  the  maximum  likeiihootl 

* A 

estimators  C,  Ey  of  the  two  parameters  C,  Ey  exist: 


V!T  - 


cc<Tj 


) ) H. 
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where  n,  * number  of  runs  terminating  In  failure  in  the 

J rh  tost 

H,  « total  tii7»  of  successful  and  unsurcessful  ru 

^ in  the  test 


the  asymptotic  (for  la'-ge  o/s)  variances  are  given  by 


Var(C)  « 


Va  H^) 


and  the  correlation  coefficient  by 

l 


\ 

/ 

! 


p(C,Ej)  f< 


] 


1/2 


The  usefulness  of  the  asymptotic  variances  and  correlation  co- 
efficient is  in  the  construction  of  a confidence  region  for  the  parameters 
C,  ET  based  upon  normal  distribution  theory. 


Regarding  the  system  exerciser  program,  it  is  not  emphasized  in 
[17]  that  this  program  must  present  to  the  subject  software  input  data  which 
represents  the  operational  situations.  We  will  term  It  an  "operational 
profile",  and  it  is  defined  essentially  by  the  probability/  distribution 
of  the  input  data  variables.  More  will  be  said  on  this  point  in  5.8. 4.1 
and  5.2 ,4.2  where  some  recent  work  by  TRW  on  a mathematical  theory  of 
software  reliability  is  discussed. 

5.2.8  The  Jel Inski  - Moranda  Model  [18] 

This  software  reliability  prediction  model  is  actually  a special 
case  of  the  Shooman  model,  as  pointed  out  in  [17].  It  can  be  expressed 
directly  as  follows.  Assume  that  the  amount  of  debugging  time  (Shooman's 
nomenclature)  between  error  occurrences  has  an  exponential  distribution, 
with  error  occurrence  or  failure  rate  (hazard  function)  proportional  to  the 
number  of  errors  remaining  in  the  software.  Each  error  discovered  is 
immediately  removed  from  the  software,  decreasing  the  number  of  errors 
remaining  by  one.  Consequently,  the  density  function  for  the  time  of  dis- 
covery  of  the  i error,  measured  from  the  time  of  discovery  of  the  i-1 
error  is: 


p(t1)  = x1  e^Vi 


where 


xi  3 $ (N  ••  i + 1) 

and  where  N is  the  number  of  errors  originally  present.  Based  upon  a 
sequence  of  observations  tj,  tg  . . . , tj,,  it  is  shown  in  [19]  that  the 
maximum  likelihood  estimators  for  N and  <?  as  well  as  their  asymptotic 
variances  and  correlation  coefficient  are  given  by  the  following  fonnulas. 
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To  find  the  maxiiwm  likelihood  est1:*ators»  libeled  N,  solve 
the  two  equations: 

k 

y>  - 1 + l)"1  - k/'N  + 1 - ek) 

1*1 


where 


N + 1 - ok 


0 * B/Ak 


a-  E S 

1-1 


B-  E it, 


The  asymptotic  variances  and  correlation  coefficient  are  given 


Var(N)  « k/^D 


Var(<p)  = S2/D 


p(M)  = - A^/[kS2]' 


where 


D = kS2/^2  - A2 
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To  obtain  numerical  values  of  the  above  quantities,  the  numerical 

A •* 

values  of  the  estimators  N,  $ are  substituted  wherever  N,  <{>  occur.  A 
detailed  example  is  shown  in  [19], 


It  is  worthwhile  to  note  the  modification  in  the  Jelirsski  - 
Moranda  model  suggested  by  Wolverton  and  Schick  [20],  The  modification  is 
to  assume  that  the  error  rate  is  not  only  proportional  to  the  number  of 
errors  present-  but  also  to  the  time  spent  in  debugging,  so  that  the  chance 
of  discovery  increases  as  timo  goes  on. 


5.2.2. 1 Generalization  of  the  0-M  and  S-W  Model  s- 


Both  the  Jell  ns ky  - Moranda  and  Schick  - Wolverton  models  may  be 
generalized  by  allowing  more  than  one  error  to  occur  in  a time  interval 
of  interest,  with  no  corrections  being  made  until  after  the  end  of  the  time 
interval. 


The  likelihood  function  is  slightly  different  from  that  of  Ref- 
erence [19]  as  the  observations  are  now  the  number  of  errors  occurring  in 
prespecified  time  intervals,  rather  than  the  waiting  times  to  each  of  the 
observed  errors.  All  the  errors  occurring  in  i specified  time  Interval 

are  then  assumed  to  be  removed  before  resuming  testing.  Nevertheless,  the 

* * 

equations  for  the  solutions  of  the  estimators  $ and  N are  quite  similar. 
The  details  of  the  derivation  are  not  given  here,  only  the  results;  the 
chcnges  to  the  equations  given  previously  and  the  reinterpretation  of 
parameters  where  necessary  are  pointed  out. 


The  two  equations  to  solve  for  N become: 


and 


K/k 

N + 1 - KO 


where 


M 


N + 1 - Ko 


E- 

y 
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Vl 


IM 

E ‘i 

1-1 

M 

E t,  (t,.,  ^ y2) 
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Jellnsky- 

Moramia 


Schick  - * 
Wolverton 


B » 


H 


S ^ni-i  + ^ ti 
1-1 


M 


} * ^1-1  * ^ * tf/2) 
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Jellnsky- 

Horanda 


Schick  -* 
Wolverton 


* The  primary  difference  between  this  and  the  model  of  Reference  [20]  Is 
that  here  it  is  assumed  that  the  error  discovery  rate  aj  is  constant 
during  the  time  Interval  t^,  so  that 


•\.j  * $ (N  - (Tj_.j  * ^•j/2) 

i.ea.  xi  is  proportional  to  the  number  cf  errors  remaining  following  the 
1-laT  time  interval,  but  also  proportional  to  the  total  time  previously 
spent  in  testing  (including  an  "averaged"  error  search  time  during  the 
current  time  Interval  t^). 
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Length  ol'  time  Interval  In  which  Mi  errors  are 
observed 


■ Cumulative  time  through  i-l5C  interval 


E • v° 


1 


n,.  ■>  * Cumulative  number  of  errors  observed  yp  through 
the  i -1st  time  Interval 


",  ‘E  «, 


"o  * 0 


M * Total  number  of  time  intervals 


K * Mi  * nM  * number  err°T*  observed 


(#  * B/AK 

^ A 

The  two  equations  to  solve  for  $,  N reduce  to  their  previous  counterparts* 
when  * 1.  In  that  case  M = K (*  k in  Reference  [19]).  Also  n^_^  then 
becomes  simply  1-1.  The  only  remaining  change  is  that 


$2  becomes  ^ 


IN  - n.  ,)T 


5.2.3 


Other  Models 


Two  models  which  were  formulated  to  deal  with  the  problem  of  pre- 
dicting reliability  growth  In  complex  systems  are  briefly  described  here. 
They  were  chosen  from  among  awny  for  various  reasons,  primarily  for 
simplicity  of  assumptions,  but  also  for  variety  of  other  factors  considered, 
and  because  they  are  considered  more  or  less  an  original  contribution, 
finally,  some  of  the  concepts  expressed  and  developed  should  have  useful 
application  to  software  reliability  predictions. 

5.2.3. 1 The  Weiss  Model  T211 

This  model  (one  of  three  described  In  [21]),  besides  assuming  an 
exponential  distribution  of  time- to-f allure,  considers  that  there  arc  M 
sources  of  failure  at  the  beginning  of  a development  process.  Trials  are 
made  (at  unspecified  points  In  time)  each  lasting  time  Tj.  and  success  or 
the  time  of  failure  is  recorded.  The  failure  rate  is  assumed  to  be  dif- 
ferent for  each  source  of  failure  and  Is  expressed  by  an  a priori 
probability  distribution  of  failure  rates.  When  a failure  occurs  on  a given 
trial  there  Is  a probability  Pc  1 that  the  failure  Is  corrected.  Most 
of  the  consequences  of  these  assumptions  are  worked  out  in  some  simple 
examples,  the  primary  outcome  being  the  predicted  reliability  function 
following  the  removal  of  errors  In  a given  number  of  trials.  The  author 
observes  that  the  mean  time  to  failure  can,  for  the  complex  model  just 
described,  be  fitted  by  an  exponential  function  of  trial  number,  showing 
a constant  percentage  Increase  with  trial  number. 

The  model  described  may  be  useful  to  evaluation  of  software  relia- 
bility if  It  and  the  assumptions  are  modifiable  as  follows:  Each  of  the  M 
sources  of  failure  represents  a software  error  type,  e.g.,  "specified 
array  size  not  large  enough,"  which  may  have  several  occurrences  throughout 
the  software.  For  some  error  types,  such  as  the  above  example,  once  the 
error  Is  discovered,  there  is  a high  probability  that  all  other  occurrences 
of  this  error  type  will  be  removed  at  once.  For  other  error  types,  a 
relatively  small  fraction  of  its  occurrences  elsewhere  In  the  software  will 
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be  discoverable  and  corrected  once  discovered.  This  fraction  corresponds 
to  the  parameter  p«  in  the  described  model  except  that  its  value  would 

w 

generally  be  different  for  each  error  source  ar.d  would  be  estimated  from 
previous  data. 

As  with  all  reliability  prediction  models  conceived  originally 
in  terms  of  hardware  performance,  for  software  the  “test  time"  must  be 
associated  with  the  degree  of  representativeness  of  input  data.  That  is 
to  say,  testing  for  a long  time  using  only  a limited  region  of  the  input 
data  space  will  result  in  unbiased  reliability  predictions  only  if  the 
probability  of  the  remaining  region  of  input  data  being  presented  to  che 
program  is  essentially  zero.  This  point  was  made  earlier  in  reference 
to  the  system  exerciser  program  mentioned  in  [17]. 

5. 2. 3. 2 Models  Proposed  by  Corcoran,  et  al  [22] 

The  referenced  models  are  mentioned  here  because  they  deal  with 
varying  probabilities  of  failure  for  different  error  sources  and  also 
correspondingly  varying  probabilities  of  corrective  action.  Another 
reason  is  that  they  are  relatively  simple  mathematically  and  seem  capable 
of  simple  interpretation  to  software  error  data.  The  models  ignore  time 
of  test  and  merely  consider  the  outcome  of  N trials,  in  which  errors 
of  the  i type  (failure  source)  are  observed.  Following  the  N trials, 
the  i^  observed  error  type  is  corrected  with  probability  a^.  As  a result 
of  this  process,  the  reliability  can  be  inferred  to  be  larger  than  the 
observed  success  ratio  by  an  amount  depending  upon  the  occurrence  of 
errors  and  their  probabilities  of  removal.  Or.e  of  the  reliability  esti- 
mators labeled  as  p^  in  [22]  is: 

K 

i>7  * V"  + E yi  (Ni  ’ 1)/N 
1=1 

where  NQ  is  the  number  of  successful  trials  in  N trials,  K is  the  a priori 
k"own  number  of  error  types,  and 
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If  N1  > 0 


1 ) 

10  if  N1  « Q 

It  is  shown  in  [22]  that  p^  Is  unbiased  asymptotically  for  large  N,  and 
its  variance  approaches  zero  for  large  N.  An  exact  expression  for  the 
expected  value  of  p^  and  an  approximate  expression  for  the  variance  of  py 
are  also  shown. 

Again,  in  this  model  the  a^  must  be  estimated  from  prior  Informa- 
tion or  data,  and  to  be  useful  for  software  reliability  estimation,  each 
trial,  or  at  least  the  collection  of  N trials  must  represent  the  submission 
to  the  program  of  Input  data  regions  in  accordance  with  the  operational 
profile. 

5.2.4  The  Nelson  Model  C23] 

The  Nelson  model  was  developed  at  TRW  as  a mathematical  theory 
of  software  reliability  (MTSR)  to  provide  a foundation  for  investigating 
software  reliability,  viz: 

• precise,  mathematical  definitions  of  the  basic  elements  of 
software  reliability, 

m mathematical  relations  on  these  elements,  and 

• mathematical  methods  for  manipulating  the  elements  to  derive 
new  aspects  of  software  reliability. 

The  model  is  described  in  this  section  and  compared  with  the  other  models. 
Work  performed  on  application  and  extension  cf  MTSR  is  reported  in  Section 
6.0. 

5.2.4. 1 Description  of  the  Nelson  Model 

The  intuitive  definition  of  software  reliability  given  in  5.1 
can  be  changed  into  a precise,  although  statistical,  definition  by  making 
use  of  the  following  primitive  concepts: 
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• A computer  program  p may  be  defined  [24]  as  a specification  of 
a computable  function  F on  the  set  E; 

E « (Ei  : 1 » 1,2,... ,N) 

of  all  input  data  values,  each  E,  being  the  collection  of 
data  values  needed  to  make  a run  of  the  program. 

• Execution  of  p produces,  for  each  input  E*»  the  function  value 

F^).  1 

• The  set  E defines  all  the  computations  which  p can  make; 

*.e.,  each  input  Ei  corresponds  to  a possible  run  of  p and 
each  run  corresponds  to  an  Input  Ej. 

t Owing  to  imperfections, in  the  implementation  of  p,  p actually 
specifies  a function  F , which  differs  from  F,  the  function 
the  program  is  Intended  to  specify. 

• For  some  Ei,  the  deviation  of  the  actual  execution  output 
r^Hi)  from  the  desired  output  F(E^)  is  within  an  acceptable 

tolerance  Aj  ; i.e,, 

I F*(E1)  - F(E,j)  | < Ai 


• Fo*’  all  other  Ei,  which  form  a subset  Ee  of  E,  execution  of  p 
does  not  produce  acceptable  output;  l.e. , either: 


I F* (E1 ) - F(Ei)  j > or 

• execution  terminates  prematurely,  or 

• execution  fails  to  terminate. 

Such  occurrences  are  called  "execution  failures". 

Each  Ej  is  a possible  combination  of  the  values  which  can  be 
assigned  to  the  input  variables  (the  variables  whose  values  must  be  pre- 
sented to  p to  enable  p to  execute).  The  number  N of  possible  is  very 
large,  but  it  is  finite  because  only  a finite  number  of  different  values 
can  fit  in  a fixed-sized  computer  word. 
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The  process  of  presenting  E^  to  p end  executing  p to  produce 
the  output  F'(E^)  or  an  execution  fallum  is  called  a run  of  p,  Note  that 
all  of  the  values  that  compose  E^  do  not  have  to  be  presented  to  p simul- 
taneously. 

Tha  probability  P that  a run  of  p will  result  in  an  execution 
failure  Is  therefore  equal  to  the  probability  that  the  input  Ej  used 
In  the  run  will  be  chosen  from  E£,  If  nc  Is  the  number  of  E^  In  Ee,  then: 


Is  the  probability  that  a run  of  p with  Input  E^  selected  from  E at  random 
with  equal  a priori  probability  will  result  In  an  execution  failure  and 

% 

R « 1 - P » 1 - jp 

Is  the  probability  that  a run  of  p with  input  E^  selected  from  E at  random 
with  equal  a priori  probability  will  produce  acceptable  output. 

In  operational  use  of  a program,  however,  the  Inputs  are  not 
usually  selected  from  E with  equal  a priori  probability.  Rather,  they 
are  selected  according  to  some  operational  requirement.  This  requirement 
may  be  characterized  by  a probability  distribution  p^,  p^  being  the 
probability  that  Ej  Is  selected.  The  set  of  p^'s  is  called  the  "operational 
profile".  P may  be  expressed  in  terms  of  p^  by  defining  an  "execution 
variable"  y^,  which  Is  assigned  the  value  0 If  a run  with  computes  an 
acceptable  function  value,  and  which  is  assigned  the  value  1 If  a run  with 
Ej  results  In  an  execution  failure.  Then 


N 

f * E Pin 
1-1 

is  the  probability  that  a run  of  p with  Input  E^  chosen  according  to  the 
probability  distribution  will  result  in  an  execution  failure,  and 
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R « 1 - P 


* 


0 - yi) 


is  the  probability  that  a run  with  input  chosen  according  to  the 
probability  distribution  p^  will  result  in  correct  execution. 

Since  R is  the  probability  of  p not  resulting  in  n execution 
failure  in  a single  run  made  with  input  selected  according  to  p^,  the 
probability  of  there  being  no  execution  failures  In  n runs  with  each  of 
the  inputs  selected  independently  according  to  is: 

R(n)  « Rn  « (1  - P)n 

Thus  a mathematical  definition  of  the  reliability  of  a computer  program 
is: 

e The  probability  that  the  program  has  no  execution  failure  in 
n runs. 

The  unit  of  exposure  of  a program  is  therefore  the  run. 

In  operational  use,  the  inputs  for  n runs  usually  are  not 
selected  independently  but  in  a definite  sequence,  such  as,  i.e.,  ascending 
values  of  some  input  variable  or  in  a sequence  determine^  by  some  real 
input,  as  in  the  case  of  a real  time  program.  Then  the  operational  profile 
must  be  redefined  to  be  p.^,  the  probability  that  E , is  chosen  as  the  input 
to  the  run  in  a sequence  of  runs.  Then  the  probability  Pj  that  run  j 
results  in  an  execution  failure  can  be  written: 

N 

pj  * E w 

i*l 


Thu  reliability  R{n)  of  p is  the  probability  that  no  execution  failure 
occurs  in  a sequence  of  n runs: 

R(n)  = (l-P^d-P,)  ...  0-Pn)  * n (1-Pj) 

j=l 
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This  formula  may  be  written  In  exponential  form: 


R(n)  » e 


£ w-V 

j-1 


matlom 


Some  of  the  properties  of  R(n)  may  be  exhibited  by  making  approxl- 


• for  Pi  « 1 n 


-X>, 

j-1 

R(n)  - e 

• If  Pj  « P for  all  j 

R(n)  - t-fn 

R(n)  may  be  expressed  In  terms  of  the  execution  time  t by  making  the  sub- 
stitutions: 

• Atj  denotes  the  execution  time  for  run  j 

j 

tj  » y Atj  denotes  the  cumulative  execution  time  through 
run  j 


r,(tj) 


In  (1  .-Pi) 


- £ Atjh(tj} 
j-1 

R(n)  * e 


If  At. is  considered  to  approach  zero  as  n becomes  large,  the  sun  in  the 

w 

exponential  becomes  an  Integral,  producing  the  formula: 
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R(t)  * exp 


which  fs  familiar  from  hardware  reliability  theory-  Ih  the  case  « 1, 
h(tj)  may  be  interpreted  as  the  “hazard"  function,  which  when  Multiplied 
by  atj  is  the  conditional  probability  of  failure  during  the  interval 
[ty  + Atj)  given  no  failure  prior  to  t^. 

5.?. 4.2  Measurement  v Software  Reliability 

wi—  s wiww — i — — ■— mm  — — 

The  reliability  of  a computer  program  can  be  measured  by  running 

* 

the  program  with  a sample  of  n inputs  and  caluclating  R,  the  measured  value, 
from  the  fomila 


A 

where  ne  is  tne  number  of  Inputs  for  which  execution  failures  occurred. 

If  the  n Inputs  In  the  sampl.  are  chosen  from  E at  random  according  to 

A 

the  probability  distribution  p^,  the  measured  value  R is  an  unbiased  esti- 
mate of  R in  the  sense  that  the  expectation  of  R over  the  sample  probability 
distribution  is  equal  to  R,  for  p^  « 1.  This  can  be  shown  by  introducing 
a "sample  variable"  z^  which  Is  defined  so  that: 


Zjj  * 1 if  E.  is  in  sample  j 
z^j  * 0 otherwise 

The  number  of  distinct  inputs  nj  obtained  in  a sample  j may  be  less  than 
n,  since  the  same  E^  can  be  selected  more  than  once  by  the  random  process. 

N 

£ - nS 

1-1 
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Howevtr,  In  most  cases  the  number  of  p -sslble  Inputs  H is  so  much  'vrger 
than  the  sample  size  that  it  is  very  unlikely  that  any  repetition  or  in- 
puts would  occur.  If  s^  is  the  probability  that  sample  j Is  chosen  and 
M Is  the  number  of  possible  samples: 

£*uv’  • <>V 

j=l 

A 

Since  R can  be  written  In  the  fom: 

N 

*j  “ n ]C  *1j 

1-1 

A A 

The  expectation  of  R,  E(R),  is: 

'(«  ’ Esj"j 
j-1 

M N 

■iE'jE  <’->,> 

j-1  1-1 

N M 

• n E <’-0)  E sJz1j 

i-1  j-1 

N 

■ w E c1  - 

1-1 

N 

s 2 (1_yi)  pi  for  pi  <<:  1 
i-1 
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1 - p 


« R 

In  order  to  carry  out  a measurement  of  R,  the  operational  pro- 
file p.j  must  be  determined.  In  actual  practice,  this  is  done  by  dividing 
the  ranges  of  the  input  variables  into  subranges  and  assigning  probabili- 
ties that  an  input  will  be  chosen  from  each  subrange,  based  on  an  estimate 
of  the  occurrence  of  inputs  in  the  operational  use  for  which  the  relia- 
bility is  being  measured.  The  probabilities  p.  having  been  determined  in 
this  way,  a sample  of  n inputs  can  be  chosen  at  random  — e.g.s  with  the 
aid  of  a -andom  number  generator  — in  accordance  with  the  p..  When  the  n 
runs  of  the  measurement  are  made,  the  outputs  for  some  of  the  Inputs  will 
be  found  to  be  correct  while  execution  failures  may  occur  for  other  inputs. 
When  an  execution  failure  occurs,  the  measurement  process  should  not  be 
stopped  end  the  error  corrected.  Instead  the  entire  sample  of  n runs 
should  be  made  without  correcting  any  of  the  errors  which  caused  the  exe- 

A 

cution  failures  and  the  value  of  R can  be  calculated  from  the  measurement 
data. 

This  approach  to  measuring  software  reliability  was  tested  at 
TRW  by  applying  it  to  two  programs,  each  written  by  a different  person  from 
the  same  specifications.  A profile  p^  was  established  to  approximate 
what  could  be  expected  in  operational  use  of  the.  program.  A sample  of 
1000  inputs  was  selected  from  this  profile  wi'ch  the  aid  of  a random  number 
generator  and  1000  runs  made  with  each  program.  The  output  for  each  run 
was  examined  and  the  runs  which  resulted  in  execution  failures  were  identi- 
fied. Three  execution  failures  were  obtained  for  one  of  the  programs  and 

A 

35  for  the  other.  The  measured  reliability,  R,  was  caluclateC  to  be  .997 
for  the  first  program  and  .965  for  the  second.  The  program  having  the 
higher  reliability  had  the  simpler  structure,  supporting  the  notion  that 
complex  structure  leads  to  reduced  reliability. [7], 


5. 2. 4. 3 Comparison  with  Other  Models 

Both  the  Shooman  model  and  the  Oellnski-Mo^anda  models  use  time 
as  the  unit  of  exposure  and  an  exponential  formula  for  R(t): 

R(t)  « e“ht 

The  hazard  function  h Is  assumed  constant  during  operational  time  and 
changes  only  when  an  error  Is  found  and  removed,  after  which  t must  be  re- 
set to  zero.  Since  this  formula  can  be  derived  from  the  Nelson  model  by 
making  certain  approximations,  the  approximations  required  may  be  used  to 
define  the  conditions  under  which  the  Shooman  and  Oellnski-Moranda  models 
are  valid;  viz: 

• t should  be  interpreted  as  cumulative  execution  time  from 
a specified  starting  time; 

• t should  be  largo  compared  to  the  average  execution  time  At 
per  run; 

§ the  Inputs  for  successive  runs  should  be  chosen  at  random 
according  a probability  distribution  approximating  that 
expected  in  the  type  of  use  for  which  the  reliability  pre- 
diction is  made. 

Both  Shooman  and  Jelinski-Moranda  attempt  to  extend  their  models 
by  assuming  that  the  hazard  function  is  proportional  to  the  number  of 
errors  remaining  in  the  program  and  then  to  apply  their  models  to  the 
testing  of  programs. 

The  Weiss  model  and  the  Corcoran  models  were  developed  to 
describe  reliability  improvement  during  testing  and  attempt  to  Incorporate 
the  effect  of  there  being  several  sources  of  errors  present  in  a program. 

All  of  these  models  are  limited  In  their  applicability  because 
they  were  not  developed  to  correspond  closely  to  the  properties  of  programs 
and  how  they  are  used  and  tested.  They  rely  principally  on  general 
principles  of  probability,  such  as  the  exponential  dependence  of  relia- 
bility on  exposure,  and  on  simplistic  assumptions  about  the  affect  of 
errors.  Although  the  exponential  dependence  is  well  established,  the 
other  aspects  of  these  models  appear  to  be,  at  best,  rough  approximations 
to  software  properties. 
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The  Nelson  model  wss  developed  frcsn  the  basic  properties  of  com- 
puter programs  and  uses  probability  theory  to  deal  with  those  aspects  for 
which  incomplete  information  Is  available  — e.g.,  what  input  will  be  chosen 
in  the  next  run.  Where  approximations  are  made,  they  are  well-defined 
and  the  limits  of  their  applicability  are  known.  Because  of  its  founda- 
tion on  software  properties,  it.  can  be  extended  systematically  to  describe 
in  more  detail  other  aspects  of  software  reliability.  Some  of  the  exten- 
sions which  have  been  made  are  described  in  Section  6.0.  Because  of  its 
broader  foundation  and  extensibility,  it  is  referred  to  there  as  a mathe- 
matical theory  of  software  reliability  rather  than  a reliability  model. 


6.C  APPLICATION  AND  EXTENSION  OF  THE  MATHEMATICAL  THEORY 
OF  SOFTWARE  RELIABILITY 

The  basic  elements  of  tho  mathematical  theory  of  software  reliability 
(MTSR)  are  described  in  Section  5.2.4.*  This  section  reports  on  work  done 
in  the  Software  Reliability  Study  to  apply  and  extend  the  theory  so  that  it 
has  become  a broad  conceptual  framework  for  exploring,  comprehending,  and 
analyzing  the  development  and  testing  of  reliable  software,  with  mathemati- 
cal tools  for  conducting  quantitative  analysis.  MTSR  was  applied  to: 

• Analyze  software  reliability  data  (6.1) 

• Investigate  specific  problems  (6.2) 

• Input  data  set  partitioning  (6.2.1) 

• Reliability  measurement  uncertainty  (6.2.2) 

• Effect  of  software  error  removal  (6,2.5) 

• Effect  of  program  structure  (6.2.4) 

• Estimating  reliability  from  test  results  (6.2.5) 

• Program  safety  effects  (6.2.6) 

• Develop  improved  techniques  for  writing  reliable  programs  (6.3) 

• Develop  improved  software  testing  methods  (6.4) 

6 . 1 Application  of  MTSR  to  Project  5 Test  Data 

Test  data  from  several  routines  developed  and  tested  on  Project  5 
were  analyzed  with  the  aid  of  MTSR.  The  nature  of  this  analysis  can  be 
illustrated  by  showing  how  it  was  performed  on  one  if  the  Project  5 routines 
which  will  be  called  "routine  A". 

Routine  A is  written  in  FORTRAN  and  has  20  executable  statements.  Its 
code  is: 

IF(GN.NE.O. ) GOTO  10 
JF(CN.LT.CT)  GOTO  5 
IE  = 1 
GOTO  25 
5 IE  = 0 
GOTO  25 

♦Some  of  the  notation  used  In  this  section  has  been  defined  in 
Section  5.2.4. 
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10  IF(Cfl.lT.TR)  GOTO  20 
IE  * 1 
GOTO  25 
20  IE  « 0 

25  IF(IE.NE.l)  GOTO  40 
JE  = JE+1 
KI  * JD 
KM  = 2 
KR  « 3 
KB  = JA 
KE  ■ JB 

JV  « JV  + KI  + 1 
KG  = 1 
40  RETURN 
END 

The  analysis  of  routine  A began  with  identification  of  the  input 
variables,  their  tye2s  and  ranges.  The  variables  to  which  values  must  be 
assigned  in  order  for  routine  A to  execute  are: 

• GN,CN,CT,YR  : real  data  type 

• JA,OB,JD,JE,JV  : integer  data  type 

Fcr  both  data  types,  tho  variables  range  over  the  values  which  can  be  stored 
in  a computer  word. 

The  testing  performed  on  routine  A involved  four  test  cases.  The 
,-alues  assigned  co  the  real  variables  were: 

• Test  Case  1:  GN  = 0. 

CN  = 5. 

CT  = 4. 

TR  = 6. 

• Test  Case  2:  GN  = 1. 

CN  = 8. 

CT  = 4. 

TR  = 6. 

• Test  Case  3:  GN  = 0. 

OH  = 3. 

CT  = 4. 

iR  = 6. 
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• Test  Case  4:  GN  ■ 1. 

CN  * 5. 

CT  « 4. 

TR  - 6. 

Tho  values  arsigned  to  the  integer  data  were  the  same  for  all  4 test  cases 
and  are  not  listed  here. 

Next  the  logic  paths  which  are  executed  by  each  of  the  test  cases 
were  determined  as  well  as  the  set  of  inputs  which  cause  each  logic  path  to 
execute. 

in  test  case  1,  the  value  0 assigned  to  6N  causes  tSve  branch  test 
(GN.NE.O.)  in  the  first  IF  statement  to  be  evaluated  false,  so  execution 
proceeds  to  the  second  IF  statement,  where  the  value  5 assigned  to  CN  and 
4 to  CT  cause  the  branch  test  (CN.LT.CT)  to  be  evaluated  false.  Next  the 
internal  variable  IE  is  assigned  the  value  1 and  execution  transfers  to  the 
fourth  IF  statement,  where  the  test  (IE.NE.l)  is  evaluated  false  and  execu- 
tion continues  down  to  the  RETURN  statement.  This  logic  path  is  executed 
for  the  input  variable  GN  * 0 and  for  values  of  CN  which  are  greater  than 
or  equal  to  the  value  of  CT.  Note  that  the  real  variable  TR  and  all  of  the 
integer  variables  are  not  involved  in  any  computation  for  this  logic  path. 
Therefore  the  set  Gj  of  input  data  values  which  will  cause  this  logic  path 
to  execute  are  completely  defined  by: 

Gj:  GN  = 0 
CN  a CT 

A similar  analysis  shows  that  test  case  2 executes  a different  logic 
path,  which  is  defined  by  tho  sei  G2  of  input  data  values: 

G2:  GN  + 0 
CN  £ TR 

The  variable  CT  and  the  integer  variables  are  not  involved  in  any 
computations. 
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Test  case  3 executes  a third  logic  path,  which  is  defined  by  the  set 
G,  of  input  data  values: 

V 

G3:  GN  - 0 
CN  < CT 
JA  = I 
JB  * I 
JD  - I 
JE  = I 
JV  = I 


Where  l denotes  any  integer  value.  The  variable  TR  is  not  involved  in  any 
calculations. 

Test  case  4 executes  a fourth  logic  path,  which  is  defined  by  the  set 
G^  of  input  data  values: 

G4:  GN  y-  0 
CN  < TR 
JA  = I 
JB  = I 
JD  « I 
JE  = I 
JV  = I 


The  variable  CT  is  not  involved  in  any  computations. 

There  are  no  other  logic  paths.  The  set  E can  be  defined  as  the 
union  of  the  four  sets  G^,  G2,  G3,  G^. 


F = Gj  U G2  U G3  U G. 


fete  that  this  is  not  equal  to  the  cartesian  product  of  the  sets  oyer  which 
the  input  variables  range,  for  the  possible  values  of  input  variables  are 
constrained  by  the  relations  used  in  defining  the  subsets  G^,  including 
the  nonutilization  of  some  of  the  variables  in  each  G.. 

J 
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Each  test  case  is  from  a different  Gj  and  causes  execution  of  a 
different  logic  path.  Collectively,  they  exercise  all  the  logic  paths. 
No  logic  path  was  executed  by  more  than  one  test  case. 

A logic  diagram  can  be  constructed  for  routine  A in  terms  of  the 
branch  tests  B^: 


Bj  : GN.NE.O. 

B2  : CN.LT.CT 
B3  : CN.LT.TR 
B4  : IE.NE.l 

and  the  code  segments  S^: 

Sj  : IE  * 1 

52  : IE  * 0 

53  : IE  - 1 

54  : IE  » 0 

55  : JE  * JE  + 1 

KI  * OD 

KM  - 2 

KR  * 3 

KB  * OA 

KE  * JB 

JV  = JV  + KI  + 1 

KG  = l 

Sg  : RETURN 
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From  this  logic  diagram,  it  would  appear  that  there  are  8 logic  paths; 
however,  4 of  these  apparent  paths  are  not  executable.  For  each  of  the 
four  paths  leading  into  B^,  the  outcome  of  the  branch  test  is  predeter- 
mined; i.e.,  for  all  inputs  that  can  cause  execution  of  a specific  path 
leading  to  B4,  the  test  in  will  result  in  the  same  truth  value,  allow- 
ing only  one  of  the  branches  to  be  taken. 

The  partitioning  of  the  input  data  set  E into  subsets  6^.  that  corre- 
spond to  logic  paths  was  done  for  four  Project  5 routines.  For  all  of  them 
the  test  cases  belonged  tc  different  G^.  This  was  not  surprising  since 
TRW's  software  test  tool  PACE  was  used  to  aid  in  developing  the  test  cases. 
PACE  analyzes  the  routine  structure  and  provides  information  to  aid  in 
developing  a set  of  test  cases  which  will  collectively  exercise  all  seg- 
ments and  branches  in  the  routine.  For  some  of  the  routines,  the  number 
of  test  cases  needed  to  exercise  all  segments  and  branches  is  less  than 
the  number  of  G^’s,  because  in  most  routines,  in  particular  the  larger 
ones,  all  segments  and  branches  can  be  exercised  by  executing  a subset  of 
the  set  of  all  ’ogic  paths. 

Additional  results  of  the  application  of  MTSR  to  Project  5 routines 
are  reported  in  the  following  sections. 


«g«] 


6.2  Extension  ofMTSR 


j.  . ~ ac 


The  capabilities  of  MTSR  were  extended  by  investigating  several 
software  reliability  problem  areas,  applying  MTSR  concepts  to  them,  and 
developing  new  mathematical  tools  and  techniques  as  needed  to  perform  the 
indicated  analysis. 

6.2.1  Input  Data  Set  Partitioning 

For  more  detailed  analysis  of  software  reliability,  it  Is  convenient 
to  partition  the  set  E into  disjoint  subsets 


C * S1  U S2  U ...  U Sk  - U Sj 

j 


Sj  n for  i f j 

t denotes  the  null  set.  The  probability  P^  that  an  input  E^  is  chosen 


from  is: 


vE  * 

Ei‘5J 


Ev 1 


LM 


Each  subset  S.  can  be  further  partitioned  into  two  subsets  Si  and  S'l 

3 J j 

such  that  any  input  from  SI  results  in  correct  execution  end  any  input  from 

J 

S'J  results  in  an  execution  failure. 

J 


X 


si  u s2  * s. 

3 j j 


sjnsj-. 


s»  . jE,c5j  : y,  * 1 } 
sj  ■ |Eicsj  = * ° | 

where  y^  * 0(1)  if  input  £{  results  in  correct  execution  (execution  failure). 
The  probability  Pj  that  an  input  is  a member  of  Sj  is: 


pirI  o-*t>  "t 


E-cS . 
13 


and  thn  probability  P'J  that  an  input  is  a member  of  S'i  is: 

J J 


■L 


y.  p* 


E . cS . 
1 J 


P 


j 


+ P'i 
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The  reliability  formula  can  be  expressed  in  terms  of  Pj  or  Pj: 


N 


E 

l-i 

*1  Pi 

k 

-E 

E yipt 

j-i 

Vs  i 

k 

k 

E 

pj‘Epi 

j-1 

j«i 

One  type  of  partitioning  of  E is  that  into  the  subsets  6^  associated 
with  the  logic  paths  Lj,  as  was  done  in  the  analysis  of  routine  A in 
Section  6.1.  is  then  the  probability  that  logic  path  Lj  will  be 
executed  and  PJ/Pj  is  the  probability  that,  having  selected  an  input 
from  Gj,  lj  will  be  executed  correctly. 

Since  the  code  sequence  of  a logic  path  Lj  is  itself  a program,  Lj 
may,  in  accordance  with  the  mathematical  definition  of  a program,  be  inter- 
preted as  a specification  of  a function  Fj  on  Gj.  For  all  inputs  E^Gj,  the 
same  code  sequence  Lj  is  executed  and  there  is  no  branching  outside  of  Lj? 
therefore  Fj  may  be  considered  to  have  a degree  of  continuity  over  Gj. 
Correct  execution  of  Lj  for  a particular  input,  say  E^,  from  Gj  verifies 
that  the  program  computes  Fj(Ek)  correctly.  Because  of  the  continuity 
argument,  one  may  infer  from  the  correct  execution  of  Lj  for  input  Ek  that 
Lj  will  also,  with  high  probability,  execute  correctly  for  most  of  the 
other  E^  which  are  in  Gj.  The  following  argument  tends  to  reinforce  the 


► 
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intuitive  feeling  that  this  probability  is  high.  Lj,  in  fact,  specifies 
a function,  which  may  be  denoted  by  Fj,  arid  which  is  intended  to  be  Fj. 
Correct  execution  of  Lj  with  input  means  that: 

|Fj(Ek! 

where  Ak  is  the  acceptable  tolerance.  The  notation  indicates  that  the 
tolerance  Ak  could  be  different  for  each  input  E^,  although  in  general  it 
will  have  the  same  value  for  a subset  of  Gj.  When  the  tolerance  must  be 
met  uniformly  for  the  function  Fj,  the  notation  Aj  would  be  used. 

Many  of  the  types  of  errors  that  can  be  made  in  code  lj  will  result 
in  Lj  specifying  a function  Fj,  the  values  of  which  will  deviate  from  the 
corresponding  values  of  Fj  by  more  than  the  acceptable  tolerance  Aj  for  all 
the  points  in  G..  There  are,  however,  known  types  of  errors  for  which  F'. 

J ¥ 

can  equal  Fj  at  one  or  more  points  and  deviate  from  it  at  all  other  points; 

e.g.: 

• The  assignment  statement  Y « X*2  + A at  the  point  X * 2 assigns 
the  same  value  to  Y as  the  statement  Y * X**2  + A,  but  assigns 
substantially  different  values  at  all  other  points. 

• In  the  assignment  statement,  Y « (Q(X)  - f)0)P(X)  + R(X),  at 
the  point  X such  that  Q(X)  * Q0,  any  error  in  P(X)  will  not 
affect  the  value  assigned  to  Y,  but  errors  in  P(X)  may  signifi- 
cantly affect  the  values  assigned  to  Y for  other  values  of  X. 

In  order  for  such  errors  to  lead  to  jFj(Et)  - Fj(Et)|  £ dt  ^or  execut’on 
of  test  case  Ef,  not  only  must  such  an  error  be  present  in  l.  but  also  the 
input  Efc  chosen  for  the  test  case  must  be  the  same,  or  very  near  the  same 
point  at  which  the  evaluation  coincidence  occurs.  Developing  an  expression 
for  the  probability  of  such  an  occurrence  would  involve  developing  a 
measure  for  the  functions  Fj  specified  by  erroneous  Lj  and  the  subset  of 
this  measure  associated  wi th  those  functions  FI  for  which  |F'.(E.)  - 

IJ  J n 

_ s A^  for  some  points  E^  in  Gj. 

These  considerations  can  be  investigated  further  by  defining  the 
probability  of  acceptable  execution: 

Pr(|Fj(Et)  - Fj(Et)|  < At) 
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with  the  function  F\  considered  to  be  a random  variable  owing  to  the  random 
occurrence  of  errors  which  cause  the  specified  to  differ  from  the 
intended  Fj.  Then,  if  the  effects  of  errors  do  not  have  a bias  - i.e., 
the  expectation,  $Fj(Et))  3 Fj(Et)  ~ 3 ^onn  lhe  generalized  Tchebycheff 
Inequality  may  be  applied,  which  states  that  if  g(x)  is  a non-negative 
function  of  the  random  variable  x,  then  for  any  k > 0, 

Pr(g(::)  s k)  > 1 - 

Since  the  expected  value  of  the  square  of  the  difference  between  a random 
variable  and  its  expected  value  is  by  definition  the  variance  of  the  random 
variable: 


* (Fj(Et)  - Fj{Et))  2J*  Var  (Fj(?t)) 


then  replacing  k by  , 


Pr('Fj(Et)  -Fj(Et)U  V'1 jf 


Var  (Fj(Et)) 


If: 


Var  (Fj(Et))  < t\  (1-y)  , 


then: 


PK|Fj(£t)  - FJ(Et)j  <it)>Y  (*) 

which  if  y is  sufficiently  close  to  1,  say  0.9,  the  probability  of  correct 
execution  is  acceptably  high. 
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F 


The  previous  results  can  be  further  refined  by  accounting  for  the 
probability  distribution  of  the  Input  Data  point,  Et>  The  following  is 
true,  approximately: 


Var  (Fj(Et))  * 


Wl* 


Var  (Et)  + V*. 


where  V'.  * variance  of  FH£(EJ),  and  3FKE*.)/3F*.  is  evaluated  at  E„  *£(EJ, 

j JCt  li  I, 

which  measures  the  relative  rate  of  change  of  the  specified  function  FI,  Thus 

'1 


implies  (*). 


Var  (Et)  <4(1-,: 

* - 


When  N executions  of  the  program  are  made  using  a random  sample  of 
E selected  fr^n  the  Input  Data  Subspnce  G.,  it  is  easily  shown  that: 

**  J 


implies  (*). 


rsF:-(Ef)i2  p 

'j  + Var  (Ef)  < l\  <!--,)« 


Owing  to  the  factor  N,  this  condition  becomes  easier  to  meet  as  N 
gets  larger.  Thus,  for  all  other  quantities  being  fixed,  as  H increases 
the  probability  level  y for  any  Input  uata  point  Et  resulting  in  a correct 
functional  output  within  a given  tolerance  also  increases. 

In  general,  since  the  Input  Data  point  is  a member  of  a multi- 
dimensional space:  E^(X1  , ...  , Xn)  , the  previous  expressions  need  to  be 
replaced  by  their  n-dimensional  analogues.  The  previous  inequality 
generalizes  to 


livT^l2 

n Zv 

(k=i  L * J 


which  implies  (*). 


yar(XK)  + V'  (1  - Y) 


i 


For  simplicity,  It  ;s  assumed  in  the  previous  expression  that  the  Xj,  are 
Independently  selected,  so  that  the  covariances  of  the  are  not  included. 

There  fre  also  types  of  errors  which  affect  only  one  point  which  can 
occur  frequently  enough  to  warrant  special  care  to  be  taken  to  check  for 
them;  e.g.: 

a Divide  by  zero.  It  is  possible  to  automatically  identify  all 
statements  in  which  this  could  occur. 

• Wrong  branch  operator,  such  as,  e.g.,  using  .IE.  tor  ,LT.  in 
the  second  statement  of  routine  A (Section  6.1), 


6.2.2  Reliability  Measurement  Uncertainty 

In  5,2. 4. 2,  a method  of  measuring  the  reliability  of  a program  was 
described  which  involved  making  runs  with  a sample  of  n inputs  chosen  at 
random  from  E in  accordance  with  the  probability  distribution  p^.  There 
are  other  methods  of  sampling  which  may  be  more  appropriate  for  particular 
situations.  Some  of  these  methods  were  investigated,  making  use  of  the 
partitioning  of  E into  subsets  Sj,  and  developing  expressions  for  the 
variance  and  confidence  limits  on  the  reliability  measurements  obtained 
using  the  sample. 


6.2.2. 1 Sampling  Theory 


Assume  that  a preassigned  number  n.  data  points  are  sampled  from  S., 

J J 

for  each  j.  The  sampling  is  assumed  to  be  simple  random;  i.e,,  selection 
of  any  point  of  S.  is  unaffected  by  selection  of  any  other  point*,  and  each 

J 

has  the  Sutne  chance  of  being  chosen.  We  merely  observe  whether  a selected 
point  belongs  to  $*•  or  S'l  so  that  this  type  . 

M J 


sampling  is  also  known  as 
simple  binomial;  i.e,,  the  probability  assigned  to  a sequence  of  n.  points 


sampled  from  S.  is 

J 


(p»}fj(p,)nrfj/p  "j 

J J J 


selected  from  S'.'  - those  evoking  software  failure. 

J 


, where  f . is  the  number  of  points 

J 


There  are  various  other  sampling  methods,  such  as  sampling  until  a 
specified  number  of  points  from  S'l  are  selected,  thus  making  n.  rather  than 

J si 

f.  a random  variable,  but  these  will  not  be  discussed  further. 

si 


★ 

There  are  so  many  points  ~n  Sj  that  the  effect  of  replacement  or  nonre- 
placement on  the  probabilities  of  selection  can  be  neglected. 

6-13 


Estimation  for  ft 

In  genera?  not  all  may  be  sampled.  If  this  Is  the  case  the 
estimator  R defined  below  will  be  biased.  Thus  if  T is  the  collection  of 
indexes  of  subsets  Sj  of  the  partition  which  are  sampled  (T  will  denote 
the  remaining  j's), 


and  the  expected  value  of  R,  denoted  by  £(R)  is 


£<r)  -1-2  wi  * 1 -L  p3 81  -L  p3  ■ R 


(2) 


where  the  latter  sumnation  is  over  all  j = 1.2,  ...  ,K.  Therefore,  when 

sampling  is  incomplete,  R will  be  biased  on  the  high  side.  By  apportioning 

just  one  data  ,.oint  sample  to  each  of  the  in  T (but  keeping  the  total 

sample  size  £ n.  = In.  a n constant)  the  bias  in  the  estimator  £ 
j 1 T ^ 

would  be  removed.  We  now  need  a measure  of  the  precision  of  R. 

One  such  measure  is  the  variance,  abbreviated  as  V(  ).  Thus,  since 
V(f.)  = n.P'.  P"/P?,  it  can  be  shown  that 

J J J W J 


9 


pi  7"j 


(3) 


is 


the 


Since,  however,  R is  biased  fcr  incomplete  samples,  a better  measure 
mean-square  error,  defined  by£[(R-R}2].  Ue  have 


J 
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Thus  the  mean-square  error  can  never  be  less  than  the  number 


2 


even  should  the  all  become  large.  This  makes  £[(R  - R)^]  a measure  of 

sampling  completeness  in  a certain  sense;  however  there  is  no  way  of  calcu 

la  ting  it,  for  although  we  have  information  on  the  sizes  of  Pj  or  Pj  for 

j c T,  nothing  is  known  about  Pj  for  j c T,  except  the  crude  bounds 

0 £ P'i  < P..  Now  it  is  easy  to  show  that 

J J 


e 


~7 


• vt(r) 


(5) 


and  therefore,  using  the  crude  bounds  on  P'.'  for  j c T,  an  approximate  set 

vJ 

of  numerical*  lower  and  upper  bounds  on  the  rr.s,an-square  error  are 


T "j  <V1!  T It  7 


(6) 


The  existence  of  the  above  unbiased  estimator  for  VpR)  evidently 
requires  that  every  n-  > 2. 

J 

Minimization  of  Variance 

Since  the  variance  of 


R = 


pj 


>"  denotes  a numerical  estimate  of  the  quantity  within. 


5-1  £ 


► 


t 


mm 


for  a complete  sample  is  given  by 


D»  OH 

v<5>  - LV 

j : 


The  Question  arises  whether  the  n,.  can  be  chosen  in  some  manner  to 
make  V (R)  as  small  as  possible.  Thus  for  a given 


n = iij  , if  we  write 
j 


pi  pn 

■?-v 


then  solve  the  set  of  equations  OW/On^  = 0 for  all  j together  with  the 
constraint 


X>k*n- 

k 


we  may  obtain  an  extremal  solution  for  the  n^.  which  can  then  be  shown  to 
yield  a minimum  V(R).  Thus 


Consequently 


3W  _ 


pi  pn 

ili  , x2 

2 A 

nj 


— \ s / p ' p"’ 

n Zu  kKk 


Therefore 


E^ 


(7) 


{The  "a"  is  used  since  r.  Is  an  Integer.) 

5 ? 

This  choice  of  will  yield  a minimum  since  D W/3nj  > 0 for  all  sub- 
sets for  which  P\  , P'.'  > 0.  Thus,  substituting  the  solutions  for  the  n., 

j j j 

the  minimum  value  of  the  variance  becomes 


t8: 

Of  course,  unless  the  val***r  of  Pj  are  known  before  sampling  takes 
place,  the  optimum  choice  of  the  n^  cannot  be  deliberately  made.  If  a pre 
liminary  sample  v/ere  taken  however,  and  assuming  no  significant  change  in 
the  value  of  PL'  (e,g.,  as  a result  of  correcting  the  detected  faults),  the 
final  sample  of  n^  from  each  subset  could  be  chosen  to  yield  a near- 
minimal  variance  estimate. 

A numerical  example  will  illustrate  the  relationships  discussed. 


Let 

Pj  = 0.5, 

P2  * 0.3, 

P3  - 0.2 

Let 

V{  * 0.1, 

P£  * 0.1, 

P3  = ° 

(and  therefore) 

Pi  = 0.4, 

= 0.2, 

f3  * O'2 

Assume  that  n -•  100.  and  that  are  chosen  proportional  to  P.  so 
that  nj  = 50,  n2  = 30,  0^=  200  Then 
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Assume  sampling  incompletet  so  that  for  jeT,  a 0,  and 

J 


nj  s n * ^here  T U T B (all  j) 

jrt 


2 

What  choice  of  n.  will  minimize  x ? 

To  determine  the  n.,  we  set  up  the  function 

J 


W 


s 


!VnP/ 


n 


E 


P . - 2.\ 

J 


Set 


3W  . ?<VnPk> 
5nk  nPk 


2X  = 0 


Hence 


k e T 


Therefore 


"j  * 


PPJ 


La 

keT 


(10) 
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*4 

The  minimized  value  of  xfc  is  then 


r 


jeT 


(ID 


o 

Thus  the  minimum  possible  value  of  x is  proportional  to  the  ratio  of 
the  sum  of  probabilities  for  those  subsets  not  sampled  to  the  sum  of  prob- 
abilities for  those  subsets  that  are  sampled. 
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Confidence  Limits  for  R 


Several  measures  of  "confidence"  in  the  measure  of  software  reliability 
may  be  developed  from  the  sampling  theory  presented  in  the  previous  sec- 
tions. The  term  "confidence'*  applied  in  a technical  sense  to  software 
reliability  means  a measure  of  how  closely  the  reliability  is  estimated 
based  on  a series  of  test  outcomes  and  a prespecified  rule  for  estimating 
the  reliability  from  the  outcomes*  "Confidence"  has  also  been  used  to 
denote  degree  of  representativeness,  thoroughness,  etc.;  of  the  test  strat- 
egy or  sampling  technique  as  discussed  in  [25]. 

If  V denotes  an  estimator  for  the  variance  of  R,  then  based  on  asymp-r 
totic  normality  arguments, 


y 


r4*i„t  n 


will  be  an  approximate  confidence  interval  covering  R with  confidence  y» 
where  xa  is  the  standard  normal  deviate  exceeded  with  probability  a.  The 
above  confidence  interval  would  be  expected  to  be  more  correct  when  the 
numbers  of  tests  n^  for  each  subset  were  large.  For  smaller  numbers  of 
tests  sbould  be  replaced  by  tbe  student  t-deviate 

with  n degrees  of  freedom,  exceeded  with  probability  (l-y)/2.  The  latter 
confidence  interval  is  wider  than  the  former  end  as  all  n.  get  large  it 
will  approach  the  "nonral"  confidence  interval.  For  example  t^0.c  Q5  * 
1.697; t120.0  5 = 1.658;  whereas  Q5  * 1.645. 

We  are  generally  interested  in  one-sided  confidence  intervals  of  the 
form  Rl  < R < 1,  where  is  called  the  lower  confidence  limit  on  R.  For 
y confidence,  the  one-sided  intervals  for  the  previous  cases  would  then  be 

R-Al_Y/T  or  R-tn;1,Y/T  (12) 


In  any  of  the  above  cases,  If  sampling  were  incomplete  the  intervals 

cover 


R ■ S pj  "M<:h  s 2 ri 

T j 

Consequently  the  true  value  of  R nould  be  cr  “red  by  the  confidence  inter- 
val with  probability  y’  - Y* 

The  magnitude  of  the  difference  between  y and  y'  is  not  easily  esti- 
mated; however,  a "safety  margin"  could  be  incorporated  by  using  slightly 
higher  values  of  y in  determining  or  V.;l-y*  ^ese  differences  are 
not  likely  to  be  Important  however  in  that  the  whole  procedure  already  has 
many  elements  of  approximation. 

The  approach  to  finding  lower  confidence  limits  on  reliability  given 
in  the  next  section  avoids  the  asymptotic  normality  assumption,  and  can  be 
termed  an  "exact"  method.  Some  additional  background  on  previous  results 
in  this  area  is  provided. 

Background  on  Exact  Methods 

Since  about  1958,  many  papers  have  been  written  on  the  problem  of 
constructing  statistical  confidence  limits  on  the  reliability  of  a system 
brsed  upon  observed  failure  data  on  the  components  of  the  system  [26],  A 
typical  problem  is;  A system  is  serial  with  respect  to  failures  of  its 
two  components  which  make  up  the  system;  i.er,  if  either  component  fails, 
then  the  system  fails.  If  the  failure  of  either  component  has  no  influence 
on  success  or  failure  of  the  other,  i.e.,  the  events  failure^,  failure2  are 
statistically  independent,  then  the  reliability  of  the  system  is  R * R^Rj,* 
Given  nj,n2  tests  of  each  component  and  f ^ ,f g failures  observed,  where 
0 < fj,  f2  < n^,  n2,  the  Neyman  definition  [27]  leads  to  construction  of  a 
system  of  lower  confidence  limits  (f j,f2;nj  n2»  y)*»  i.e.,  such  that  Prob 
(R^  < R)  > y where  y is  the  specified  confidence  coefficient  (e.g.,  0.50, 
0.90,  0.95)  and  R is  the  actual  (but  unknown)  reliability  of  the  System. 

This  particular  problem  was  addressed  originally  in  [28]  and  [29]  and 
subsequently  in  [30]. 


* 
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For  the  software  reliability  model  since  the  actual  reliability  with 
respect  to  the  jth  subset  of  a partition  is  P^/P^  r the  problem  becomes 
one  of  determining  a system  of  lower  confidence  limits  on  the  quantity 

R ■ L'  rs 

j 

for  all  possible  outcomes  fj  and  given  sample  sizes  n^,  j ' 1,2,  ,K. 

Neyman  Confidence  Limits 

■■■■  ■■■"■  ■ ■ ■ ■ ■ ■«■  ■ VII— ■ 

Since  the  general  formulation  of  the  Neyman  or  classical  confidei.ee 
limits  problem  is  cumbersome,  the  formulas  will  be  given  only  for  the  case 
in  which  there  are  two  subsets  in  the  partition. 

Ir  this  case 


R » R,  P-L  + R2  P2 


(14) 


where 


V 


P2 


Based  upon  [28],  a lower  confidence  limit  on  R for  observed  numbers  of 
failures  fj,f2  out  of  nj,n2  tests  respectively  is  obtained  by  minimizing 
(with  respect  to  X,,X2)  the  'unction  )'  P.  +X?P,>  with  the  following  equality 
constraint: 


fl*f2 

V 

L-j 

W0 


(1-Xj)  4 


(1 


X2) 


= 1 


(15) 


i 
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and  also  by  the  Inequalities 


0 s XrX2  s l 

where  y is  the  confidence  coefficient. 

The  resulting  lower  confidence  limit  is  denoted  by 

A 

*2inItn2fY^ 

The  quantity  summed  in  (15)  is  the  product  of  the  probabilities  of 
observing  exactly  ij  failures  out  of  nj  tests  ami  i2  failures  out  of  n2 
tests*  where  sampling  is  from  the  respective  subsets  Sj,S2  and  are  indepen- 
dent observations  (both  within  each  subset  end  between  subsets).  Not 
apparent  in  the  summation  is  a prescribed  ordering  of  the  sample  outcomes 
'frf2;nrn2)*  Thc  or<1*r*n9  not  unique  as  shown  in  [23],  but  there  are 
various  ways  of  choosing  an  ordering  which  is  {'more  or  less"  optimum.  Here, 
the  meaning  of  '‘optimum”  is  that  for  every  f,,f2,  each  of  the  confidence 
limits  \ (fi»f?Jni»n2»Y)  for  the  Prescribed  ordering  are  at  least  equal  to 
the  corresponding  R£  (fj.fgin^n^Y)  obtained  from  some  other  ordering;  i.e., 
all  other  things  being  equal*  the  larger  confidence  limits  are  better. 

One  v:ay  of  ordering  the  sample  points  is:  for  any  observed  (f^fgi 
nj,n2)  to  include  in  the  summation  the  probabilities  of  all  points  (i^,i2; 
nj,n2)  such  that  0 < ij  s fjjO  j i*2  s fg.  This  ordering  will  be  satisfac- 
tory except  when  n^  and  n2  are  very  different.  A better  ordering  Is 
obtained  by  Including  all  sample  points  (i^igjn^ng)  for  which  the  esti- 
mated reliabilities  have  the  relationship. 


n,  - i 


P2* 


"l  ~fi 


P 


2 


which  should  result  in  values  of  which  are  ordered  in  the  same  way  as 
the  estimated  reliabilities,  although  this  assertion  could  not  be  verified. 
In  any  suitable  ordering,  however,  the  sample  outcome  (O.Ojn^ng)  w,ll  be 
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the  first  point,  corresponding  to  the  largest  for  the  set  of  possible 
outcomes. 

For  this  case  (15)  becomes  simply 


n.  np 

Xf  - 1 - Y 


It  is  e&sy  to  show*  that  the  minimum  value  of  X^  Pj  4 X2  Pg  subject  to  the 
constraints  (15')  and  0 s Xj,X2  s 1 is 


n,/n  i\Jn 


ft)'  ft) 


where  n - n^  + n2  and  Pj  + P2  * 1.  The  generalization  to  K subsets  in  the 
partition,  in  which  case  the  problem  is  to  minimize 


Evj 

j=l 


subject  to 


j=i 


ni 

x.J  = i - 


'■  - Y 


and  the  requirement  that  each  X.  satisfy  0 s X.  £ 1,  is  also  easy,  an^  we 

J J 


nft)' 


Using  the  method  of  Lagrange  multipliers. 


Examples:  (Both  examples  are  based  upon  the  hypothetical  data  of 
Reference  [25]). 

1-a)  n^  ■ 2,n2  * l,n3  * l,n4  * 1 Hence  n * 5 
P1  ■ 0.5,P2  * 0.33333, Pf  0.1, P4  * 0.06667 
Therefore 

Rl  « 5(1-y)°*?(0.25)0*4  (0.33333)0*2  (0.1)0-2  (0.06667)0*2 
* 0.33616  (1  -y)0,2 


Y *»  0.50  0.90  0.95 


\ = 0.7367  0.5340  0.4649 


1-b)  nj  * 6,n2  * 3,n3  * l,n4  * 1 Hence  n ■ 11 


The  P.  are  the  same  as  in  Example  1-a), 

J 


Therefore 

Rl  = il(l-Y)°* 090909  (0.08333)0, 54545  (0.11111)0,27273 

(0.1)0, 090909  (0.066667)0,090909  = 0.98781  (1  - y)0,090909 


Y = 0.50  0.90  0.95 


Rl  = 0.9275  0.8012  0.7523 


o 


I !, 


b ' i\ 


Whan  sampling  is  proportional  to  the  probability  of  the  subset;  i.e., 
nj  " nPj’  ^en 

\ = n(l-Y)1/n^a  (1-y)1/r  (20) 

which  Is  the  usual  lower  binomial  confidence  limit  on  reliability  obtained 
when  n tests  are  made  with  zero  failures.  This  is  of  course  equivalent  to 
the  case  when  there  is  only  one  subset. 

The  formulation  of  a method  for  construction  of  Neyman  confidence 
limits  in  general  is  apparently  very  difficult.  However,  the  Neyman  con- 
fidence limits  do  not  require  the  assumptions  of  an  a priori  probability 
distribution  as  does  the  Bayesian  method,  to  be  presented  below,  so  they 
are  in  a sense  to  be  preferred  over  the  system  of  confidence  limits  gener- 
ated by  the  latter  approach.  Nevertheless,  the  relative  ease  of  determining 
Bayesian  confidence  limits  may  be  sufficient  to  dictate  abandonment  of  the 
Neyman  confidence  limits.  As  indicated  later  in  the  Conclusions,  however, 
a compromise  is  recommended. 

The  following  sections  present  one  Bayesian  method  for  constructing 
lower  confidence  limits  on  software  reliability  and  show  how  numerical 
solutions  may  be  obtained  in  general. 

Bayesian  Confidence  Limits 

The  a pootcriori  probability  density  of  the  reliability  R.  for  the 

th  ” 

j subset,  given  f • failures  in  n.  tests  and  assumed  uniform  a priori 

s)  J 

probability  density  of  R*  is  given  by  the  well-known  formula 

J 


r(n.  + 2)  n.  - f . f. 

P,ob  Crj  < «j  < V dr,)  . F(r  -.'>  . 1 1)  -(f . , T)  >•/  » - rj)  dr, 


K ft 

vi  M 


which  is  a Beta  density  function. 


Once  the  distribution  of 


j“i 

is  found,  the  lower  y confidence  limit  is  defined  as  the  value  of  r for 
which  Prob  (R  a r)  * y . Thus  the  problem  is  to  find  the  distribution  of  a 
weighted  sum  of  independent  Beta  distributed  random  variables. 

This  problem  is  formidable  analytically.*  However,  an  approximate 
solution  is  to  assume  that  the  random  variable  R itself  has  a Beta  distri- 
bution and  determine  its  parameters  by  matching  the  moments  of  R with 
those  of  , which  are  readily  calculated. 

The  mean  of  R.,  denoted  by  £(R.)  is  easily  found  to  be 

J J 


A n.  - f . + 1 

J 


The  variance  of  R^,  denoted  by  V(R^ ) is 


(n,  " + D * 1) 

V ( R - ) * — * * — *- — * 

J (n.  + 2)z(n,  + 3) 


We  have 


m - X) 


(21) 


(22) 


(23) 


★ 

A Mcnte  Carlo  evaluation  may  be  suitable,  particularly  for  large  values  of 
K,  See  Reference  [34]. 


► 
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and  since  the  R.'s  are  ’independent 

J 


K 

V(R)  * £ P|  V(Rj)  (24) 

j=l 

Thus  if  the  density  of  R is  given  by  the  Beta  distribution  with 
parameters  p,q 

Prob  (r  < R < r + dr)  = rp  “ 1 (1  - r)q  " 1 dr  (25) 

then  we  equate 


P 

p + q 


n. 


-ew-'Ei 

j*i 


ihlL 

n • + 2 


pq. 

(p  + <i)Mp  + q > l) 


(nj-f.MHP.tl) 
(nj  + 2)z(nj  + 3) 


Since 


and 


p + q ’ T^RSJ 

Equation  (27)  yields 


(26) 


(21) 


(28) 


(29) 


(30) 
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and  Equation  (28)  results  in 


p = £(R) 


rmii-m 


Tables  of  the  Beta  distribution  [31]  or  a computer  program  can  then  be  used 
to  determine  the  value  of  r such  t'.-at  Prob  (R  i r)  * y. 

Examples: 

2-a)  We  use  the  same  data  as  in  the  previous  Example  1-a). 


0*1  j * 2 j * 3 j * 4 


n.  - f,  + 1 


“ O.'/'OOO  0.66667  0.66667  0.66667 


P.  *■  0.37500  + 0.22222  + 0.06667  f 0.04444 


. * . £(R)  = 0.70833 


(n.  - f . + l)(f . + 1) 


(nj*2)  <nj+  3) 


= 0.03750  0.05556  0.05556  0.05556 


?.  = 0.009375  + 0.006173  + 0.000556  + 0.000247 


. * . V(R)  = 0.016350 


p = 8.2421 


q = 3.3939 


The  values  of  r corresponding  to  Preb  (R  a r)  » y where  R has 
a Beta  distribution  with  parameters  p ■ 8.2421,  q * 3.3939 
are  given  below: 


Y * 0.50  0.90  0.95 

&L  * 0.7206  0.5334  0.4780 

2-b)  He  use  the  same  data  as  in  the  previous  Example  1-b). 

4*1  4=2  j-3  4*4 

1 

- - 0.87500  0.80000  0.65667  0.66667 

« 

X 

P.  » 0.43750  + 0.26667  + 0.06667  + 0.04444 

J 

. ’ . £(R)  * 0.81528 
(n.  - f . + l)(f . + 1) 

— J J = 0.012153  0.026667  0.055556  0.055556 

(nj  + 2)  'nj  + 3) 

X 

P?  = 0.0030382  + 0.0029630  + 0.0005556  + 0.0002469 

J 

. * . V(R)  = 0.0068037 
p = 17.2310 
q = 3.9042 


The  values  of  r corresponding  to  Prob  (R  > r)  B y where  R has 
a Beta  distribution  with  parameters  p a 17.2310,  q * 3.9042 

are  given  below: 

y * 0.50 

0.90  0.95 

Rl  - 0.8253 

0.7032  0.6640 

The  results  of  the  heyman  and  Bayesian  methods  compare  closely  In 
Examples  1-a)  and  2-a),  but-  in  Example  2-b)  the  Bayesian  lower  confidence 
limits  are  significantly  smaller;  l.c..  more  conservative,  than  the  Neyman 
limits  of  Example  1-b). 

The  preceding  analysis  considered  that  the  distribution  of  £pjRj 
could  be  approximated  by  a Beta  distribution  using  the  first  two  moments 
of  ihe  distribution  of  each  Rj  and  the  known  formulas  for  the  first  two 
moments  of  a sum  of  random  variables.  Reference  [32]  presents  a technique 
which  provides  correction  terms  to  a fitted  Beta  distribution;  however,  at 
the  expense  of  calculating  more  moments  of  each  Kj  and  a considerably  more 
difficult  computation. 

Mr,  J.  E.  Wolf  of  TRW  had  previously  developed  the  technique  of  [32] 
for  application  to  a slightly  -“'fcrent  problem  [33],  A brief  description 
of  the  application  to  the  pn,  > problem  is  given  in  the  following 
paragraphs. 

The  basic  idea  is  to  fit  a curve  of  the  form 

g(x)  = aQxa  (1  - x)s  (1  + djX  + ...  + akxk)  (32) 

to  the  probability  density  function  for  Rj  . When  just  trie  constant 
term  aQ  is  present,  we  have  the  previous  case  of  fittirg  the  density  func- 
tion to  a Beta  density,  with  aQ  = r(a  + s + 2)/r(a  + 1}  r(e  + 1)  (Equation  (25) 
with  p = a + 1,  q = B + 1),*  In  general  when  all  terms  through  akxk  are  to 


Ncte  that  the  requirement  for  the  area  under  the  curve  between  0 and  1 to 
be  unity,  essentially  determines  one  of  the  parameters. 
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be  used,  the  first  k + c moments  of  each  Rj  are  required.  The  inordinary 

moment,  or  moment  about  the  origin  for  the  distribution  of  R.  is  given  by 

0 


The  next  step  Is  to  determine  the  ordinary  moments  M.  of  the  random 
variable 


The  first  moment  Mj  5 £(R)  was  already  given  by  (23).  The  second 
central  moment  was  given  by  (24).  The  second  ordinary  moment  is  given 
by? 


j 3<k 


(34) 


the  third  ordinary  moment  is  given  by 


«3  = 3N,  H2  - 2mJ  + p]  - 3^  p]  + 2^  P^)3  (35) 
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The  fourth  ordinary  moment  Is  less  straightforward  and  Is  t.o 
complicated  to  be  given  here. 

Using  the  higher  moments  (which  are  relatively  easy  to  compute)  the 
table  below  gives  the  computed  lower  confidence  limits  corresponding  to 
several  higher  order  fits  of  Equation  (32)  to  the  density  function  for 
]£  PjRj»f°r  the  previous  Example  1-b). 


y * 0.50 

0.90 

0.95 

aO 

0.8253 

0.7032 

0.6640 

Vai 

0.8263 

0.7027 

0.6626 

ao»ara2 

0.82629 

0.70301 

0.66248 

o*al*a2*a3 

0.82620 

0.70310 

0.66281 

1L 

The  example  indicates  that  the  zero  order  (only  two  moments  used),  or 
possibly  the  first  order  approximation  may  be  adequate  in  general.  The 
next  example  of  the  Bayesian  method  Includes  the  case  where  there  is  a 
failure. 


Example  3-a) 

"1 

* 6,  n2  * 3,  n3  * 1, 

n4  * 1 

P1 

= 0.5,  P2  * 0.33333, 

p3  - o.i,  p4 

= 0.066667 

* 0,  f2  * 1,  f3  * 0, 

V° 

Y * 0.50 

0.90 

0.95 

ao 

0.7563 

0.6261 

0.5863 

Val 

0.7560 

0.6259 

0.5867 

Vara2 

0.75552 

0.62651 

0.58815 

an*al»a?»ai 

0,75536 

0.62698 

0.58873 

Example  3-b)  Sample  as  example  3-a)  except 


F 


i 


Ifl 


P 
? 1 


» 


1 


t 


fj  » o,  f2  * 0, 

f.  * 0,  H*  - 1 

Y B 0.50 

0.90 

0.95 

ao 

0.8016 

0.6816 

0.6439 

Vai 

0.80384 

0.68020 

0.64058 

VVa2 

0.80436 

0.68045 

0.64027 

an»al»a?»al 

0.80372 

0.6B115 

0.64068 

Both  Examples  3-a)  and  3-b)  indicate  that  the  first  or  even  zero**' 
order  approximation  is  adequate,  except  possibly  for  high  confidence 
coefficients. 

6. 2.2. 2 Conclusions 

The  8ayesian  approach  to  determine  lower  confidence  limits  on  soft- 
ware reliability,  based  upon  the  approximate  solution  developed  by  Wolf, 
affords  a method  for  handling  almost  all  situations  — any  number  of  sub- 
sets, sample  sizes  and  failures.  As  indicated  by  comparing  Examples  1-b) 
and  2-b),  however,  the  Bayesian  method  may  in  some  cases  be  conservative, 
i.e.,  yield  lower  confidence  limits  on  reliability  which  are  smaller  than 
those  obtained  by  the  Neyman  method;  or  may  be  optimistic,  i.e.,  yield 
lower  confidence  limits  wMch  are  larger  than  those  obtained  by  the  Neyman 
method. 

Consequently,  the  Neyman  method  should  be  investigated,  possibly  only 
for  two  or  three  failures  In  total,  in  order  to  keep  the  analysis  and  pro- 
gramming manageable.  Further  comparison  could  be  made  with  the  Bayesian 
method,  which  could  then  be  '’corrected"  to  correspond  to  the  Neyman  method. 
Thus  a more  optimum  system  of  confidence  limits  could  thei  be  obtained  from 
the  easily  calculated  Bayesian  confidence  limits. 
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5.2.3  Effect  of  Software  Error  Removal 

When  an  error  is  detected  -e.g.»  by  the  occurrence  of  an  execution 
failure  and  analysis  of  the  code  to  determine  the  source  of  the  failure  - 
the  failure  probability  P and  the  reliability  R of  the  program  p are 
changed.  To  investigate  the  nature  of  these  changes*  the  combined  effect 
of  several  errors  will  be  derived.  Firit,  the  failure  probability  Pj 
for  the  effect  of  a single  error  may  be  represented  as: 

N 

pi  ■£  m 

1«1 

y^  * 1 , if  error  1 causes  an  execution  failure  for 
= 0,  otherwise. 

Tne  execution  variable  y...  has  been  doubly  subscripted  to  show  that  it 
represents  the  effect  of  error  number  1.  Similarly,  the  failure  probability 
for  two  errors  is: 


N 

p2 3 Tj  pi  tyu  + y2i  - yn  -V2i> 
i*i 

y^  is  the  execution  ‘triable  for  the  effect  of  error  number  2.  The  com- 
bined effect  of  twv  errors  is  not  simply  additive  for  they  both  can  affect 
the  execution  of  p for  the  same  inputs.  The  term  compensates  for 

the  inputs  affected  by  both  errors. 

The  expression  for  more  than  2 errors  can  be  developed  by  introducing 
a new  indexed  variable  , which  is  defined  recursively  by: 


im  11 1 ■ i m,|  ^ * 1 ■ 


.HijWwii  iMma*  i cm* 


K\ 


J 


Hi 

3, 


Jj 

j 

1 

it 

t 

V 


i* 

r 

>. 


5! 

*5 


* ! 


$ 


II, 

V <* 


%, 


2u  8 hi 


h\  mhi+h\  -yn-v2i  8 1 " (1-yii)(1-y2i) 


j 

»C"l 

The  change  In  program  failure  probability  when  an  error  (assumed  to 
be  error  j)  is  detected  and  corrected  is: 


N 


pj  ■ pj-l  ” E P,  (yj1  - 

i»l 


n 

Eptyji (1- 


i*l 


N .1-1 

iBl 


In  the  development  of  this  formula,  it  has  been  assumed  that  the  errors  in 
a program  can  be  identified  and  enumerated. 

The  equation  for  P.-P.  , can  be  solved  for  cases  in  which  the  right 

J 

hand  side  has  a simple  dependence  on  j.  The  simplest  case  is  obtained  by 
setting  it  equal  to  a constant,  say  a. 


PJ  - PM  * a 


and  therefore 


P.  = aj  + b 

J 
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This  case  of  constant  decrease  in  failure  probability  corresponds  to  the 
case  where  all  errors  have  approximately  a constant  effect,  which  may  not 
be  a good  approximation  to  reality.  It  is  the  case  used  in  the  Shooman 
and  Jelinsky-Moranda  models. 

The  case  of  a linear  change  can  also  be  solved: 

PJ  - pj  . ■ + <* 

and  if  * 0, 

Pj  * £ j2  + (d  + £ )j 

■ £ j (J  + 1)  * 

It  corresponds  to  a situation  in  which  th-  error  having  the  largest  effect 
is  detected  and  corrected  first  and  the  error  having  the  least  effect  is 
detected  last.  This  seems  to  be  a better  approximation  to  reality. 

An  important  problem  In  developing  reliable  software  has  been  the 
introduction  of  new  errors  in  the  process  of  correcting  detected  errors. 

An  error  is  detected  hy  observing  an  execution  failure  for  a run  with  a 
-pecific  input,  E^.  The  source  of  the  failure  is  sought  by  examining  the 
code.  When  an  error  is  found,  that  error  is  corrected  by  changing  some  of 
the  code  and  the  effect  of  the  change  is  tested  by  running  the  program  with 
input  E^.  If  the  program  executes  correctly  for  input  E^,  the  error  is 
assumed  to  be  corrected;  however,  experience  has  shown  that  execution  fail- 
ures occurring  later  may  be  due  to  the  change  made  in  correcting  the  code 
to  make  the  case  with  input  E^  run  correctly. 
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Analysis  of  the  correction  process  with  the  aid  of  concepts  from 
HTSR  has  led  to  a test  procedure  which  should  reduce  the  occurrence  of 
correction-induced  errors.  A correction  is  made  by  changing  the  code  in 
one  or  more  segments,  a segment  oeing  a sequence  of  contiguous  executable 
statements  which  contains  no  branch.  The  logic  path  executed  by  the  test 
case  Ej,  is  composed  of  a sequence  of  segments.  The  segments  which  were 
changed  in  the  error  correction  process  may  also  be  part  of  other  logic 
paths.  The  changes  involved  in  the  correction  may  also  change  the  execu- 
tion results  of  the  other  logic  paths  and  unless  the  effect  of  the  changes 
on  these  other  logic  paths  is  taken  into  account  in  making  the  correction, 
the  changes  could  cause  execution  failures.  To  guard  against  this  happen- 
ing, whenever  an  error  is  corrected,  the  changed  code  should  be  tested  not 
only  by  the  original  test  case  but  also  by  a test  case  chosen  so  that  it 
causes  another  logic  path  containing  the  changed  segments  to  be  executed. 

If  this  is  done,  most  of  the  errors  introduced  in  correcting  errors  will 
ee  detected  and  the  multart  correction  should  have  a low  probability  of 
having  introduced  new  errors  and  correction  of  the  error  will,  in  fact, 
have  increased  the  reliability  of  the  program. 

Correction  of  the  error  wpl,  in  some  cases,  involve  a change  in  a 
branch  statement  such  that  the  segment  or  segments  to  which  the  branch 
transfers  execution  is  changed.  If  this  is  the  case,  the  changed  program 
will  have  a changed  set  of  logic  paths  and  the  partition  of  E into  the 
input  subsets  associated,  with  each  logic  path  will  also  be  changed. 

6.2.4  Use  of  Structural  Information  in  the  Analysis  of  Reliability 

The  structure  of  a program  p may  be  characterized  by  the  logic  paths 
L.,  the  input  subsets  G.  associated  with  each  logic  path,  the  segments  S. 

J J • 

of  which  the  logic  paths  are  composed,  and  the  branch  tests  which  cause 
transfers  from  one  segment  to  another  during  the  execution  of  a logic  path. 

Most  errors  are  confined  to  a single  segment;  some  are  spread  over 
several  segments;  and  others  involve  one  or  more  branches.  Each  error  will, 
however,  affect  the  execution  of  only  a few  logic  paths  - viz.,  those  logic 
paths  that  contain  the  affected  segments  and/or  branches.  It  will  not 
affect  the  execution  of  those  logic  paths  that  do  not  contain  the  errone- 
ous code.  Thus  the  error  will  not  be  detected  by  test  cases  that  do  not 


cause  the  erroneous  code  to  be  executed.  A necessary  condition  for  the 
detection,  by  testing,  of  all  the  errors  in  a program  is  that  all  segments 
and  branches  be  executed  in  the  testing.  Then  all  the  components  of  the 
program  - all  the  segments  and  all  the  branches  and  therefore  every  execut- 
able statement  - will  have  been  exercised  in  the  testing  and  there  will  be 
no  component  that  has  not  been  subject  to  test.  Satisfaction  of  thf'*  con- 
dition is,  however,  nci  oufficicnt  to  ensure  detection  of  all  errors,  as 
shown  in  the  discussion  in  Section  6.2.1. 

If  S is  the  set  of  segments,  S?  is  an  arbitrary  segment,  and  m is  the 
total  number  of  segments  in  program  t>  - i.e., 

S « {Sj  : i * 1,2,  ....  , m) 

then  execution  of  p with  input  E^  from  the  input  subset  6^  will  cause  exe- 
cution of  the  segments  in  a subset  of  S defined  by  the  logic  path  L-.  If 

l is  the  set  of  all  logic  paths  and  n is  the  number  of  logic  paths;  i.e., 

I * {Lj  : j « 1,2,  ...  , n) 

then  L includes  all  segments  and  executing  all  logic  paths  will  guarantee 
execution  of  all  segments;  however,  all  segments  are  contained  in  subsets 
of  L,  which,  for  most  programs,  have  fewer  members  than  L.  A subset  of  L 
which  contains  the  smallest  number  of  that  collectively  contain  all 
segments  is  called  a set  of  "characteristic  paths".  For  a given  program, 
there  may  be  more  than  one  set  of  characteristic  paths.  The  number  of 
paths  in  such  a set  is  usually  sufficiently  small  that  it  is  practical  to 

identify  them  - e.g,,  with  the  aid  of  a test  tool  such  as  PACE  - and  exe- 

cute all  of  them  during  testing. 

Since  each  logic  path  corresponds  to  a set  of  segments,  the  set  os 
segments  corresponding  to  logic  path  L.  will  be  contained  in  a subset  C. 

J vl 

of  a set  C of  characteristic  paths  and  all  the  segments  in  L.  will  be 

J 

exercised  by  the  test  cases  that  execute  all  members  of  C^.  Thus  if  all 

0 

the  logic  paths  in  C have  been  executed  in  the  testing,  all  logic  paths 
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will  have  had  a degree  of  testing  in  the  sense  that  all  of  their  segments 
will  have  been  exercised  in  the  testing. 

A branch  is  equivalent  to  an  ordered  pair  of  segments,  (S^,Sj). 

Since  only  a few  of  the  possible  pairs  of  segments  correspond  to  branches, 
in  n>ost  programs,  the  number  of  branches  is  usually  substantially  less  than 
m(m~l)/2,  the  number  of  possible  pairs  of  segments.  Thus  it  is  also  prac- 
tical to  identify  - with  the  aid  of  test  tools  such  as  PACE  - and  test  all 
branches.  Owing  to  the  definition  of  a branch,  executing  all  branches  will 
guarantee  that  all  segments  have  been  executed.  Thus,  testing  which  exe- 
cutes all  branches  will  have  exorcised  all  the  segments  and  pairs  of  seg- 
ments that  occur  in  all  logic  paths,  it  will  also  have  exercised  a sub- 
stantial number  of  the  possible  higher  order  (fc3)  segment  sequences.  This 
explains  the  high  reliability/ (low  incidence  of  execution  failures)  in 
operational  use  of  the  programs  which  have  been  tested,  with  the  aid  of 
test  tools,  so  that  all  branches  have  been  exercised  in  the  testing. 

Routine  A of  Project  5,  which  was  discussed  in  Section  6.1,  has 
10  segments  (the  6 segments  composed  cf  assignment  statements  plus  the 
4 branch  statements),  13  segment  pairs,  and  4 logic  p?ths.  In  the  case 
of  this  routine,  the  set  of  characteristic  paths  is  equal  t5  the  set  of 
logic  paths  and  the  minimum  number  of  test  cases  needed  to  exercise  all 
segments  also  execute  all  segment  pairs  and  all  logic  paths. 

A second  routine,  "routine  B",  has  17  segments,  24  segment  pairs, 
and  9 logic  paths.  All  segments  are  contained  in  5 logic  paths  and  all 
segment  pairs  are  contained  in  6 logic  paths.  Thus  only  5 test  cases  are 
needed  to  exercise  all  segments  and  6 test  cases  are  needed  to  exercise 
all  segment  pairs,  but  9 test  cases  are  needed  to  exercise  all  logic  paths. 

6.2.5  Estimating  Software  Reliability  from  Test  Results 

The  method  of  measuring  the  reliability  of  a program  described  in  Sec- 
tion 5, 2. 4. 2 involves  defining  an  operational  profile  p^  amt  making  n runs 
with  the  inputs  for  the  runs  chosen  at  random  according  to  the  probability 
distribution  p^.  In  Section  6.2.2,  alternative  methods  based  on  choosing 
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samples  in  various  nays  were  defined.  Since  testing  consists  of  making  a 
series  of  runs  and  examining  the  output  of  each  run  to  determine  whether  or 
not  an  execution  failure  occurred,  the  question  can  be  raised  as  to  whether 
the  data  available  from  testing  cai  be  used  to  develop  an  estimate  of  the 
reliability  of  the  program  as  of  the  end  of  the  testing. 

Testing  differs  significantly  from  the  prescribed  conditions  for 
making  a good  reliability  measurement.  The  runs  are  not  made  with  inputs 
chosen  at  random.  Test  cases  are  generally  chosen  to  find  errors  rapidly* 
based  on  the  experience  and  intuition  of  the  test  team,  the  functional 
capabilities  the  program  must  have,  or  on  the  use  of  a test  tool,  such 
as  PACE.  Thus  the  test  cases  do  not  usually  form  a representative  sample 
of  the  inputs  expected  in  operational  uss  of  the  program.  Additionally, 
after  an  execution  failure  is  detected  in  the  testing,  tho  source  of  the 
failure  is  sought  and,  after  the  error  is  found,  it  is  corrected;  so  the 
reliability  of  the  program  is  changing  during  the  testing. 

Making  use  of  the  concepts  developed  in  the  preceding  sections,  the 
following  procedure  was  developed  to  provide  a rough  estimate  of  R from 
the  test  data. 

• Define  the  input  set  E. 

• Analyze  F.  into  the  6.  associated  with  logic  paths. 

• Define,  for  the  anticipated  operational  usage,  the  Pj 
for  each  6 .. 

V 

t Determine  the  G.  to  which  each  test  case  belongs. 

J 

t Determine  the  segments  and  pairs  of  segments  which  were 
e/cerciced  in  the  testing  and  those  that  were  not 
exercised. 


• Compute  for  each  j: 


with  ajt  determined  by  the  following  rules: 

• 0.91,  if  more  than  one  test  case  belongs  to  6^  , 

• 0.95,  if  only  one  test  case  belongs  to  Gj  , 

• Q.90,  If  no  test  case  belongs  to  G.  but  all  segments 

J 

and  segment  pairs  in  L_-  have  been  exercised  in  the 

J 

testing  , 

• 0,80,  if  all  segments  but  not  all  segment  pairs  in  L. 

J 

have  been  exercised  in  the  testing, 

• 0,80-0.20  m,  ii?  m segments  (1  < m < 4)  of  L.  have  not 

J 

been  exercised  in  the  testing,  and 

• 0,  if  more  than  4 segments  of  Lj  have  not  been  exercised 
in  the  testing. 

k 

• R = Pj  , where  k is  the  number  of  logic  paths,  is  then 
j=l  a rough  estimate  of  R. 

The  values  of  the  a.  given  above  were  assigned  intuitively,  based  on  the 

J 

analysis  performed  in  this  study  and  on  experience  in  testing.  For  spe- 
cific programs  and  specific  testing,  it  may  be  possible,  to  develop  a 
better  assignment  of  values  to  the  a.'s. 

J 

To  develop  a better  estimate  of  R,  a measurement  based  on  a suitable 
sampling  technique  should  be  performed. 
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6.2.6  Program  Safety  Effects 

The  definition  of  software  reliability  given  in  6.2.4. 1 is  dependent 
on  the  definition  of  the  function  F which  the  program  is  intended  to  spec- 
ify. F is  usually  defined  in  the  program  specification;  however,  the  spec 
i fi cation  itself  may  contain  errors,  so  F is  ultimately  defined  by  the 
physical  erohlem  the  program  is  intended  to  so’ve.  The  reliability  definv 
tion  can  accommodate  itself  to  any  definition  of  F.  A changed  definition 
of  F changes  the  definition  of  an  execution  failure,  but  the  changed  defi- 
nition of  an  execution  failure  can  be  used  In  the  other  formulas  of  the 
reliability  theory  without  any  change  in  the  formulas. 

This  same  approach  can  be  used  to  adapt  the  theory  to  deal  with  the 
safety  of  a program.  Such  a question  can  arise  in  the  guidance  program  in 
a missile,  where  an  error  could  cause  a missile  control  failure  resulting 
in  the  missile  striking  a populated  area.  Another  case  is  that  of  the 
process  control  computer  in  a chemical  process  plant,  where  a software 
error  could  lead  to  control  action  that  could  result  in  an  explosion  or  a 
fire. 

The  safety  condition  S(E^)  needs  to  be  expressed  so  that  safe  execu- 
tion of  the  program  can  be  defined,  e.g5>  as: 


F'tej)  s S( ) 


and  this  definition  is  usee  in  place  of  the  previous  definition  of  correct 
execution.  Whether  or  not  a premature  te’vni  nation  or  failure  to  terminate 
is  considered  a safety  violation  will  depend  on  the  system  behavior  which 
can  result  from  them,  and  it  must  be  determined  for  each  such  system. 

With  this  definition  of  safe  execution  and  non-safe  execution,  all 
the  reliability  formulas  can  be  used  to  calculate  the  safety  of  a program. 
Note  that  unreliable  operation  of  a program  will  not  necessarily  lead  to 
safety  violations. 
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6.3  Guidelines  and  Techniques  to  Minimize  Error  Introduction 
BurTng  Software  Development 

I t is  generally  agreed  thak  one  of  the  major  contributors  to  the 
Urge  number  of  errors  which  are  found  in  most  computer  programs  is  the 
logical  complexity  of  the  program.  This  complexity  is  manifested  in  the 
large  nuirtber  of  logic  paths,  even  in  smsll  programs,  and  in  the  high  inter- 
connectivity of  the  paths.  Dealing  with  the  complexity  has  been  compounded 
by  the  fact  that  some  of  the  paths  indicated  on  a logic  diagram  are  unexe- 
cutable or  ’'phantom  paths";  i.e.,  they  cannot  be  caused  to  execute  by  any 
choice  of  input  values;  e.g.,  in  the  program,  routine  A,  which  was  analyzed 
in  Section  6.1,  4 of  the  8 indicated  paths  are  phantom  paths.  So**a»  programs 
have  been  found  to  have  more  phantom  paths  than  they  have  executab’e  or 
"real"  ones.  Thus  the  actual  structure  of  a program  tends  to  be  veiled  in 
a web  of  phantom  paths,  which  obscures  the  actual  structure  so  that  it  is 
not  readily  visible  in  the  program  text.  Owing  to  the  phantom  paths,  the 
program  text  dees  not  continually  "cue"  the  programmer  on  the  actual  struc- 
ture ns  he  writes  the  program.  A programmer  has  to  keep  a "map"  of  the 
structure  in  his  mind  as  he  is  writing  a program.  If  the  program  is  more 
than  a few  statements  long,  this  ccn  be  quite  difficult,  and  he  will  make 
errors.  Additionally,  hf  cannot  easily  see  the  errors  In  the  text  he  has 
written;  consequently,  he  does  not  catch  the  errors,  and  they  remain  in 
the  program  until  they  are  found  in  testing  or  show  themselves  in  the  form 
of  execition  failures  in  operational  use  of  the  program. 

It  is  therefore  apparent  that  the  nunfcer  of  errors  could  be  reduced 
if  programs  could  be  written  in  such  a way  that  they  contain  no  phantom 
paths.  To  verify  this,  routine  A was  rewritten  in  the  form: 

IF(GN.NE.O.)  GOTO  10 
IF(CN.LT.CT)  GOTO  30 
GOTO  20 

10  IF(LN.LT.TF)  GOTO  30 

20  JE  = JE  + 1 
KI  = JD 
KM  - 2 
KR  = 3 
KB  = JA 
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KE  - OB 

JV  - JV  + KI  * 1 
KG  * 1 

30  REiURK 
END 

This  new  form  of  routine  A contains  four  paths,  ill  of  which  ire  executive. 
Both  bnnches  of  each  branch  test  - (GN.NE.O.),  (CN.IT.CT),  ind  (CH.U.TfO  - 
can  be  taken  irrespective  of  the  outcome  of  the  evaluation  of  previous 
branch  tests.  Therefore  the  programmer*  can  make  local  checks  of  the  logic 
ir,  the  program  without  having  to  trace  the  logic  path  back  to  its  initial 
point.  The  rewritten  routine  not  only  has  a simpler  and  more  visible 
structure,  it  also  his  fewer  executable  statements,  13  as  compared  to  20 
in  its  original  form. 

Several  other  routines  of  Project  5 were  analyzed  and  found  to  con- 
tain phantom  paths.  All  of  them  were  rewritten  in  a form  containing  no 
phantom  paths.  The  new  forms  of  the  routines  all  contained  fewer  executa- 
ble statements  than  their  original  forms.  Two  of  the  Project  5 routines 
were  found  to  contain  no  phantom  paths,  as  originally  written.  They  were 
principally  in-line  code.  Each  contained  only  one  branch  test. 

The  interpretation  of  a program  p as  a specification  of  a function  F 
can  also  be  used  to  aid  in  reducing  error  introduction  during  software 
development.  K defines  the  Information  processing  problem  which  the  pro- 
gram p is  intended  to  solve.  It  Is  defined  by  whet  are  usually  called  the 
"requirements"  on  the  program.  The  representation  of  F by  a collection  of 
functions  Kj  defined  on  sets  Gj,  each  of  which  is  associated  with  a logic 
path  L.  of  n,  establishes,  a relationship  between  the  logic  path  structure 

J 

of  the  program  and  the  requirements.  Each  Fj  defines  the  portion  of  the 
information  processing  task  concerned  with  computing  values  from  the  Inputs 
in  the  set  G.;  thus  F.  defines  the  requirements  on  p for  operating  on  G.. 

J J J 

Each  F.  can  therefore  be  interpreted  as  a "functional  requirement"  asscci~ 

•J 

ated  with  a logic  path  L^. 

Expression  of  the  requirements  in  the  form  ( p j * G j ) an(*  association 
of  each  functional  requirement  with  a logic  path  can  aid  in  the  detection 
of  errors.  This  is  particularly  true  if  the  program  is  written  in  a form 
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such  that  all  its  logic  paths  are  executable.  Then  each  functional 
requirement  can  be  checked  against  the  code  of  its  corresponding  logic 
path.  In  fact,  s form  of  this  type  of  testing  is  being  performed  on  Proj- 
ect 5.  Software  requ4rements  are  allocated  to  portions  of  the  software 
system  and  from  these  the  functional  capabilities  of  individual  routines 
may  be  determined.  These  functional  capabilities  then  fu^m  the  basis  for 
developing  the  routine  level  test  cases  which  must  also  exercise  all  seg- 
ments. The  importance  of  this  test  strategy  and  a more  detailed  descrip- 
tion of  the  techniques  involved  are  presented  in  Section  4.8. 

6.4  Improved  Methods  of  Software  Testing 

A fundamental  problem  in  software  testing  is  how  to  choose  test  casts 
so  that  from  their  correct  execution  one  can  Infer  that  the  program  tested 
will  execute  correctly  for  almost  all  inputs.  Traditionally,  software 
testing  has  concentrated  on  verifying  that  the  functional  capabilities  of 
the  program  are  the  ones  it  was  intended  to  have.  This  has  involved  pre- 
paring test  cases  to  demonstrate  that  the  program  has  certain  defined  func- 
tional characteristics.  Additional  test  cases  were  then  generated,  based 
on  the  experience  and  intuition  of  the  test  team,  jntil  scheduled  test  time 
and/or  budget  ran  out.  In  spite  of  large  amounts  of  money  spent  in  testing 
in  this  manner,  the  results  were  generally  unsatisfactory,  for  the  tested 
software  tended  to  have  a high  incidence  of  execution  failures  if*  opera- 
tional use. 

The  development  of  te.ut  tools,  such  as  the  PACE  test  tool  of  TRW, 
shifted  the  focus  of  testing  to  assuring  that  the  test  case.v  exercised  all 
structural  elements  - i.e.,  all  segments  or  branches  - of  the  program. 

Usage  of  these  tools  effected  a dramatic  reduction  in  the  nunfcer  of  execu- 
tion failures  in  software  tested  with  them  (Section  4.0). 

KTSR  explains  why  these  test  tools  work  as  well  as  they  do  and  it 
shows  how  to  use  them  more  effectively.  The  test  tools  analyze  program 
code  into  segments  and  branches.  They  instrument  the  program  so  that, 
during  test  execution,  the  frequency  of  segment  (or  branch)  exercising  is 
recorded.  Those  segments  (or  branches)  not  exercised  are  Identified  and 
data  is  provided  to  aid  in  developing  test  cases  to  exercise  them.  In  Sec- 
tions 6.2.1,  6.2.3  and  6.2.5,  it  was  shown  that  correct  execution  of  a program 
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for  one  test  case  can  be  used  as  a basis  for  inference  that  the  program 
will  execute  correctly  for  almost  all  inputs  in  the  subset  G^  to  which  the 
test  case  belongs.  Correct  execution  of  test  cases  which  exercise  all 
segments  and  branches  was  shown  to  remove  the  errors  having  the  largest 
effect  (number  of  inputs  affected)*  to  provide  an  effective  test  of  a se* 
of  characteristic  paths,  and  to  provide  a dsgrte  of  testing  (exercisino  of 
their  segments  and  branches)  of  all  other  logic  paths.  Any  remaining 
errors  will,  in  most  cases,  each  affect  a relatively  small  number  of  inputs. 

In  the  discussion  in  Section  6.2.5,  it  was  noted  that  »)st  of  the 
errors  introduced  in  correcting  a detected  error  can  be  detected  by  executing 
the  program  with  inputs  chosen  such  that  they  exercise  the  code  changed  In 
the  error  correction  process  and  they  exercise  * different  logic  pith  than 
the  one  exercised  by  the  test  case  that  detected  ^ie  original  error.  It 
turns  out  that,  for  many  programs,  the  set  of  test  cases  that  exercise  all 
segments  will  also  result  in  most  segments  being  exercised  by  two  or  more 
test  cases.  However,  investigation  of  what  is  done  in  actual  testing  has 
shown  that,  as  each  error  is  detected,  it  is  corrected,  but  that,  at  the 
end  of  testing,  it  is  rare  that  the  program  is  run  with  all  of  the  test 
cases  to  verify  that  with  all  of  the  error  corrections  performed  It  exe- 
cutes correctly  for  all  test  cases. 

Based  on  the  concepts  developed  in  HTS3,  the  following  test  strategy 
is  proposed: 

• Test  the  program  until  all  segments  and  branches  have  been  exer- 
cised and  the  program  executes  correctly  for  all  test  cases. 

) For  each  error  corrected,  all  segments  and  branches  modified  by 
the  correction  should  be  tested  (if  possible)  by  at  least  two 
test  cases  which  exercise  the  modified  segments  and  branches 
and  which  cause  execution  of  different  logic  paths. 

• Identify  the  input  data  subset  G ^ , logic  path  L*,  and  functional 
requirement  F.  for  each  test  case. 

w 

• Identify  any  functional  requirements  not  directly  verified  by 
the  test  cases  and  develop  test  cases  for  them. 
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• Cjeck  the  *:ode  for  occurrences  of  all  known  types  of  errors  - 
e.g.,  divide  by  zero,  branch  test  boundary  errors,  variable  name 
misspelling,  etc. 

• Reduce  residual  errors  remaining  after  the  preceding  testing  by 
choosing  a sample  of  Inputs  at  random  from  the  operational 
profile. 

The  first  element  Is  a necessary  condition  for  the  testing  to  detect 
all  errors.  The  second  solves,  to  a large  degree,  the  problem  of  errors 
introduced  in  correcting  errors.  It  requires  only  a few  mor-2,  if  anw, 
test  cases  than  those  used  in  the  first  element.  The  third  and  fourth 
elements  assure  that  the  testing  verifies  that  the  functional  requirements 
are  met,  a necessary  ingredient  in  any  practical  test  process.  The  fifth 
element  catches  known  error  types.  The  sixth  reduces  residual  errors  and 
also  provides  a measurement  of  the  reliability  of  a program. 
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7.0  DATA  COLLECTION 

Learning  how  to  collect  software  reliability  data  is  an  iterative 
process.  Each  time  we  try  to  use  date  subsequent  to  its  collection  there 
is  the  familiar  sound  of  palm  hitting  forehead  and  the  accompanying  oath 
concerning  “the  next  tire."  Early  planning  produces  long  ‘lists  of  informa- 
tion to  be  collected,  but  not  until  the  analysis  begins  does  the  list  of 
desired  yet  unavailable  parameters  start  to  grow.  The  data  are  all  there 
as  a by-product  of  the  software  development  process i the  trick  is  in  know- 
ing uhat  to  collect  and,  generally  more  important*  when  to  collect  it.  The 
approach  often  taken  is  to  collect  anything  that  isn't  nailed  down.  This 
approach  works  fairly  well  for  parameters  that  are  purely  descriptive 
(routine  size)  and  poorly  for  parameters  which  result  from  evaluation 
(resources  required  in  closing  a problem). 

In  this  section  wa  will  briefly  discuss  data  collection,  recounting 
some  of  the  lessons  learned  and  problems  identified  while  performing  the 
Software  Reliability  Study.  Some  suggestions  for  improvement  are  also 
presented. 

7.1  Observed  Realities 

Software  projects  have  the  potential  for  creating  a tremendous  amount 
of  data,  and  as  mentioned  above,  these  data  are  largely  a by-product  of  w# 
software  developrant  process.  Even  though  the  duration  of  this  study 
allowed  a certain  amount  of  planning  and  implementation  of  data  collection 
procedures  to  suit  our  purposes,  availability  of  data  resulted  principally 
from  project  activities  other  than  reliability  oriented  studies.  Speci- 
fically, data  tired  in  the  SRS  was  the  result  „f  rigorous  configuration 
management  and  quality  assurance  practices.  The  reliability  data  collec- 
tion was  necessarily  done  on  a non-interference  basis.  The  point  to  be 
made,  here,  is  that  mechanisms  for  collecting  software  data  already  exist 
in  most  disciplined  software  development  projects.  This  can  easily  be 
seen,  for  example,  in  the  records  resulting  from  configuration  control 
during  testing.  These  are  the  problem  reports,  the  test  execution  logs, 
and  the  records  of  updates  to  the  software  configuration  generated  in  an 
attempt  to  control  testing.  But  testing  is  only  one  phase,  usually  the 
last  one,  in  the  software  development  process.  What  about  the  rest  of 
trie  development  cycle?  One  of  the  first  wishes  of  the  analyst  trying  to 
explain  test  data  is  to  know  what  went  on  prior  to  terming 

Figure  7-1  illustrates  a typical  software  development  project  by 
phase  and  points  out  when  various  types  of  data  can  become  available. 

Ovals  indicate  error  or  problem  data  that  can  be  produced  and  collected. 
These  are  the  data  most  useful  in  the  study  of  software  quality  and 
reliability.  They  form  a continuum  of  problems  and  solutions  starting  as 
early  as  the  requirements  specification  phase  and  extending  into  the 
operational  portion  of  the  system  life  cycle.  Ancillary  data,  shown  below 
the  phases  of  the  development  cycle,  become  available  rough,y  at  the  times 
indicated,  are  absolutely  necessary  to  the  understanding  of  the  error 
histories,  and  can  be  extremely  useful  in  assessinq  on-going  project  per- 
formance. Triangles  along  the  base  of  Figure  7-1  denote  points  at  which 
snapshots  of  the  software  structural  characteristics  can  be  taken  to 
gain  a picture  of  the  volume  of  change. 


7-1 


There  it  the  Implication  in  Figure  7-1  that  problem  reports  can  be 
written  during  virtually  every  phase  of  the  development  cycle.  This 
implication  is  intended,  with  one  exception— problem  reporting  during  the 
coding  and  debug  phase,  when  the  programmers  are  attempting  to  create  an 
error-free  compilation.  Ouring  this  phase  manual  data  collection  techniques 
can  represent  a counter-productive,  extra  task  to  perform.* 

Also  implied  in  Figure  7-1  is  the  fact  that  the  mechanism  for  collect- 
ing data,  including  manpower,  procedures,  and  tools,  must  exist  prior  to 
the  appropriate  phase  in  order  to  capitalize  on  the  freshness  (and  accuracy) 
of  she  data.  In  the  case  of  software  requirements  problems  this  may  mean 
having  the  collection  mechanism  ready  prior  to  contract  go-ahead. 

7.1.1  £a,U..CQUe.ct.ij^j;chJtiqufiji 

A project’s  ability  to  provide  data  will  depend  largely  on  the  data 
col  lection  tecn,,iquas  it  can  ?pply.  Experience  has  shown  that  manual  tech- 
niques tend  to  be  the  easiest  to  implement  in  the  sense  that  they  are 
readily  adjustable  to  changing  project  demands.  However,  they  require 
manpower  comnitted  to  the  task  of  collection.  Automated  techniques  ease 
the  pain  of  the  collection  task  but  do  not  necessarily  reduce  the  amount 
of  work  involved  since  they  increase  the  scope  of  the  job,  making  possible 
work  that  was  previously  not  possible  through  manual  techniques.  Automated 
techniques  also  tend  to  be  less  flexible  in  response  to  changes  in  project 
demands.  An  additional  aspect  is  the  cost  of  Implementation.  Not  all 
projects  can  support  full  blown  data  collection  tasks. 

If  conjunction  with  its  configuration  management  and  quality 
assurance  activities  TRW  has  applied  both  manual  and  automated  data  col- 
lection, storage,  and  reporting  techniques.  The  earliest  attempts  to 
collect  data  were  entirely  manual  and  centered  aronnd  software  problem 
reports  (SFR)  written  during  testing.  These  reports  were  written  at  the 
routine  level,  if  possible,  and  remained  "open"  until  a second  form 
documenting  the  corrective  action  "closed"  the  problem.  This  second  form, 
the  software  modification  record  »'J-MR),  was  actually  a vehicle  for  deliver- 
ing an  update  to  the  software  configuration.  This  two  form  approach  to 
software  problem  documentation  has  been  and  continues  to  be  used  success- 
fully by  many  software  contractors.** 

A second,  equally  successful  software  problem  documentation  scheme 
is  also  employed  on  Project  5.  A one  document,  form,  the  discrepancy 
report  (DR),  is  used  to  document  the  identification  of  a software  j.'oblem 
and,  through  update  procedures,  the  closure  of  the  proMem.  Examples  of 
these  forms  and  other  problem  reporting  forms  are  presented  in  Appendix  B. 


*Howeveo,  it  is  during  tills  phase  that  automated  erro/  data  collection  is 
most  easily  accomplished  through  the  compiler. 

**A  similar  problem  closure  form  is  the  modification  transmittal  memorandum 
or  MTM  used  on  Projects  2 and  3. 
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Both  the  one  document  and  two  document  forms  work  as  vehicles  for 
controlling  changes  to  the  software  configuration.  As  vehicles  for  soft- 
ware error  data  collection  they  also  work  - if  certain  steps  are  taken  to 
assure  dat*.  accuracy  and  completeness.  That  is,  it  is  possible  to  have  a 
description  of  the  software  problem  (its  symptom:*),  a description  of  the 
change  to  the  software  (the  fix),  and  evidence  that  the  problem  was  fixed 
(test  output)  and  to  still  not  know  what  the  actual  software  error  was. 
This  phenomenon  was  noted  in  a review  of  Project  3 software  problem 
reports.  Although  forms  were  adequate  to  cover  configuration  management 
concerns,  they  failed  to  provide  adequate  d*ta  for  thorough  software 
error  analysis.  The  Project  5 form  was  designed  specifically  to  provide 
sufficient  information  about  both  concerns. 

As  may  be  seen  in  Figure  7-1,  much  more  data  exist  than  are  docu- 
mented by  various  types  of  problem  reports,  and  sane  are  extremely  dif- 
ficult or  virtually  Impossible  to  collect  with  manual  techniques.  For 
example,  software  structural  characteristics  are  determined  by  examining 
the  source  coda  and  tabulating  the  occurrence  of  such  structural  elements 
as  loops,  branch  statements,  etc.  For  very  small  software  programs  this 
information  can  be  obtained  manually,  but  even  for  small  software  systems, 
the  job  is  too  big  and  tedious  to  attempt  using  manual  techniques. 

As  far  as  data  analysis  goes,  our  experience  in  the  course  of  this 
study  showed  that  having  the  data  in  machine  readable*  form  is  imperative. 
The  volume  of  information  handled  and  the  nurber  of  approaches  to  analysis 
typically  needed  to  bring  out  the  results  make  purely  manual  techniques 
very  ineffective. 

7.1.2  Cost  and  Schedule  Impact  of  Data  Collection 

The  actual  collection  of  data  represents  a considerable  amount  of 
work.  When  contributions  are  requested  from  project  performers,  this  work 
may  be  counter-productive  to  the  development  of  the  software  product. 

This  work  can  be  translated  into  both  cost  and,  possibly,  schedule  impacts 
to  the  project.  Depending  on  the  project,  these  impacts  could  represent 
significant  drawbacks. 


In  collecting  data  for  the  SRS  only  Project  5 afforded  the  opportunity 
to  tailor  the  data  collection  to  study  needs  and  to  ask  for  real-time 
assistance  from  project  performers  in  categorizing  software  problems. 

Aside  from  a healthy  amount  of  grumbling  about  the  added  nuisance,  coopera- 
^ tion  was  very  good.  In  the  retrospective  analysis  of  data  from  Projects  2, 

3,  and  4,  there  was  also  very  good  cooperation  from  project  performers. 
Apparently,  our  limited  data  collection  activities  did  not  adversely 
affect  any  of  these  projects  because  we  were  allowed  to  continue  our  work. 

The  only  attempt  made  to  assess  the  impact  of  our  data  collection 
activities  was  to  determine  how  much  time  it  took  to  assign  an  error 
category  to  a problem  report.  This  was  the  only  area  where  we  enlisted 
the  help  of  project  performers  on  a continuing  basis. 
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It  was  determined  that  it  takes  an  average  of  1.2  minutes  for  a 
knowledgeable*  analyst  to  assign  an  error  category  to  a software  problem 
report  if  he  does  it  long  after  the  problem  is  fixed.  An  unknowledgeable 
analyst  takes  an  average  of  1.6  minutes**  to  assign  the  error  category 
in  this  after-the-fact  fashion.  If  the  fixer  of  the  problem  assigns  the 
category  at  the  time  he  documents  the  fix,  the  time  is  considerably  less 
than  a minute  and  the  results  are  considerably  more  accurate. 

The  cumulative  time  just  to  categorize  the  software  problems  from 
Project  3 is  in  excess  of  60  manhours,  assuming  a one  minute  assignment 
time  by  a knowledgeable  programmer  when  the  problem  is  fresh  on  his  mind. 
On  the  other  hand,  as  is  pointed  out  in  Section  7,3,  this  may  be  a worth- 
while project  expenditure  in  terms  of  increased  software  quality  and 
potential  improvements  in  the  development  and  test  processes. 
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icific  Data  Collection  and  Analysis  Problems 


One  of  the  principal  problems  encountered  in  analyzing  Project  3 data 
was  identifying  the  software  error.  As  mentioned  in  7.1.1,  it  was  possible 
to  have  the  SPR,  the  MTM,  the  source  code  fix,  and  results  from  test  cases 
demonstrating  problem  closure  and  to  still  not  know  exactly  what  the  error 
was.  This  is  due  to  the  fact  that  the  MTM,  as  its  name  implies,  was 
designed  chiefly  to  deliver  or  transmit  modifications  of  software  routines 
to  the  master  configuration.  Although  the  MTM  makes  provision  lor  des- 
criptions of  the  error  and  the  fix,  this  form  was  used  principally  to 
satisfy  CM  requirements*  leaving  the  reliability  analyst  at  the  mercy  of 
the  MTM  * s,  author,  who  is  thinking  only  of  closing  the  problem.  The  col- 
lection forms  require  redesign  to  guarantee  configuration  management  con- 
trol and  reliability  data  as  equal  data  collection  objectives.  Luckily, 
these  objectives  are  not  incompatible,  and  collection  techniques  can  be 
tailored  to  satisfy  both  objectives.  Such  was  the  case  with  the  Project  5 
discrepancy  report  or  DR. 

Another  major  problem  concerns  availability  of  resources  during  the 
data  collection  process.  Many  parameters  are  not  considered  for  collection 
by  virtue  of  the  fact  that  they  are  difficult  to  collect  and,  once  col- 
lected, they  are  not  easily  stored.  For  example,  the  amount  of  code 
exercised  by  a test  case  is  extremely  difficult,  and  in  most  cases 
impossible,  to  determine  without  the  aid  of  automated  tools;  yet  this 
parameter  is  believed  to  be  essential  to  determination  of  error  freeness. 

A series  of  software  tools  designed  to  automatically  collect  and  store 
certain  reliability  related  parameters  is  essential. 


*A  programmer  who  either  developed  the  software  or  is  responsible  for 
fixing  it. 

**These  times  include  time  required  for  familiarisation  with  the  list 
of  error  categories  and  assume  all  problem  documentation  is  available. 


Manpower  resources  are  also  essential  to  the  collection  process. 

There  is  a tendency  to  reduce  the  scope  of  data  collection  to  a level 
necessary  to  support  immediate  contract  and  project  obligations.  If 
reliability  data  collection  is  not  one  of  the  project’s  principal  objec- 
tives, the  best  of  attempts  is  liable  to  be  shelved  in  the  first  schedule 
of  manpower  pinch. 

The  next  problem  is  one  of  education.  Here  the  situation  is  similar 
to  that  experienced  in  the  early  days  of  hardware  reliability  data  collec- 
tion; initially  the  idea  was  a foreign  one,  while  today  data  is  collected 
as  a matter  of  course.  In  the  collection  of  data  for  Projects  2 and  3 it 
was  discovered  that  programmers  would  have  been  more  or  less  willing  to 
provide  detailed  information  about  errors  had  they  known  that  such  informa- 
tion was  of  value  and  had  they  been  given  guidelines  for  its  provision. 
Project  5 experience  bore  this  out. 

Related  to  this  problem  is  the  necessity  of  benchmarking  the  data, 
i.e.,  collecting  data  sets  that  are  compatible  with  one  another  in  an 
evolutionary  or  time  sense.  Many  parameters,  especially  the  software 
structural  characteristics,  change  over  the  span  of  the  design,  develop- 
ment, and  test  phases.  This  change  can  also  be  prevalent  during  the  opera- 
tional usage  of  software  systems.  Although  intuitively  desirable,  a con- 
siderable amount  of  planning  is  required  to  establish  all  the  required 
resources  and  identify  the  data  to  be  collected.  Depending  on  the  software 
and  the  project  development  or  operational  characteristics,  these  bench- 
marks may  be  planned  for  almost  any  plateau  in  the  software  evolution; 
typically,  they  can  be  established  subsequent  to  the  preliminary  design, 
prior  to  formal  testing,  and  at  delivery  for  operational  use. 

Any  attempt  to  collect  data  quickly  points  to  a long  list  of  problems. 
Such  a list,  based  on  SRS  experience  in  collecti?ig,  storing  and  analysing 
software  data  from  source  projects,  is  presented  in  Table  7-1.  Although 
this  is  a fairly  comprehensive  list,  it  is  short  in  that  it  deals  in 
generalities.  For  instance,  entry  4 mentions  that  certain  valuable  data 
items  are  perishable.  The  solution  to  preserves  this  data  is  to  collect 
the  data  items  as  they  become  available.  An  example  of  this  is  the  error 
category  that  can  be  assigned  to  a report  of  a software  problem.  In  a 
closer  look  at  the  requirements  for  accurately  providing  these  data  we 
find  that  the  "who"  of  the  collection  is  just  as  important  as  "when"  the 
data  is  collected.  Typically,  it  is  the  fixer  of  the  software  problem 
alone  that  can  assign  such  a category,  and  he  must  do  the  collection  ano 
associated  analysis  when  the  error  is  fresh  in  his  mind.  The  problem  is 
further  compounded  when  it  is  realized  that  some,  very  successful  project:;, 
owing  to  variations  in  project  structure,  do  their  da„a  collection  in  a 
time  driven  mode  (data  collected  periodically  in  time)  rather  than  in  an 
event  driven  mode  (data  collected  as  it  is  generated).  Here  we  see  that 
the  opportunity  to  collect  perishable  data  depends  on  another  problem,  the 
ability  if  a project  to  provide  certain  types  of  data  (entry  16). 

We  encountered  virtually  all  of  these  problems  during  performance 
of  the  SRS.  Our  solutions  to  these  problems  were,  admittedly,  not  always 
successful,  but  successful  solutions  are  possible  and  will  be  forthcoming 
as  more  of  these  studies  are  attempted  in  the  software  community. 
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Table  7-1.  Oata  Collection  and  Analysis  Problems 


1.  Projects,  the  software,  and  the  data  vary  considerably  and 
are  not  describable  in  conroon  terminology. 

2.  Data  collection  can  represent  cost,  schedule,  and  manpower 
impediments  to  software  development  projects.  The  impact 
or  cost  considerations  of  data  collection,  although  real, 
are  not  fully  appreciated. 

3.  Data  collection  is  a lot  of  work.  The  tools  and  techniques 
for  collecting  data  are  not  available,. 

4.  Certain  data  items  are  perishable  and  must  be  collected  and 
analysed  when  they  become  available,  not  after  the  fact. 

5.  Pei'formers,  project  management,  and  even  the  buyers  of  soft- 
ware are  sensitive  about  providing  data  that  might  he  used 
to  adversely  evaluate  the  project  by  external  agencies. 

6.  Some  projects  produce  data  that  are  classified. 

7.  Analysis  techniques  and  questions  to  ask  of  the  data  are  not 
well  known. 

8.  There  is  no  quarantee  that  data  will  be  collected  (i.e.*  no 
requirement  for  projects  to  collect  data). 

S.  Data  accuracy  is  a chronic  question. 

10.  Analysis  is  often  incomplete  or  inaccurate  if  proper  communi- 
cation with  project  performers  is  not  established. 

11.  Contractor  and  customer  representatives  of  project  manage- 
ment are  not  aware  of  the  benefits  of  data  analysis  and 
therefore  tend  not  to  support  it. 

12.  Project,  structure  is  generally  not  tailored  to  use  available 
data  (i.e.,  the  mechanism  for  analyzing  data  end  folding 
results  back  into  the  project  is  not  provided). 

13.  The  fervor  of  data  collection  inspires  data  gathering  that 
is  non-supportive  of  the  software  development  process. 

14.  Some  data  elements  require  protection  to  preserve  the  privacy 
of  the  contributor  (e.g.,  cost  data). 

15.  Data  collection  is  commonly  thought  to  be  "not  necessary" 
to  a properly  managed  project. 

15.  Project  organizational  structure  and  resources  vary,  making 
consistent,  multi-project  data  collection  questionable. 

17.  Definition  of  which  parameters  are  needed  and  meaningful  to 
collect  is  in  its  infancy. 

18-  Presently  implemented  data  collection  schemes  often  fail  to 
gather  data  in  sufficient  detail,  maxing  results  of  analysis 
questionable. 
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7.2  Recommendations  for  Improvement  of  Data  Collection 

From  the  considerable  practical  experience  gained  in  collecting 
and  compiling  data  it  is  obvious  that  a major  change  in  philosophy  is 
needed  to  guarantee  ? successful  data  collection  process.  Once  we  are 
committed  to  the  need  for  such  information  we  need  to  support  this  com- 
mitment on  all  fronts.  In  this  regard  the  following  recommendations  can 
be  made: 

• Problem  reports  and  closure  reports  should  be  problem  oriented, 
providing  the  symptoms  of  the  problem,  an  accurate  statement  of 
the  problem,  and  a detailed  description  of  the  necessary  fix. 

An  ideal  situation  would  be  total  separation  of  the  problem 
closure  and  the  vehicle  for  delivering  modifications  to  the 
software,  i.e.,  two  separate  formats,  one  to  document  closure 
and  explain  the  fix,  the  other  to  deliver  the  source  code  of 
the  fix. 

• Data  collection  and  analysis  should  be  preceded  by  standards 
specifying  procedures  and  formats.  These  procedures  and 
formats  must  satisfy  configuration  management  and  reliability 
data  collection  requirements. 

• All  project  performers  should  be  made  aware  of  the  objectives 
of  the  study.  The  necessity  of  their  making  the  error  and  the 
correction  "perfectly  clear"  cannot  ba  over-stressed. 

t Further  work  to  generate  reliability  data  collection  tools  is 
necessary.  These  should  be  made  general  purpose  wherever 
possible. 

• To  guarantee  the  collection  process,  dedicated  manpower  should 
be  allocated  to  monitor  the  data  as  it  is  generated. 

• The  data  collection  must  be  started  early  in  the  software 
development  cycle;  the  pre-code  stage  of  development  is  rife 
with  information. 

E':ch  of  these  points  was  addressed  by  the  Project  5 discrepancy 
reporting  system  which  was  maintained  and  monitored  by  a quality  assurance 
organization,  and  each  is  feasible  in  the  Project  5 environment.  Of 
particular  help  in  the  data  collection  process  was  the  involvement  of  the 
project  performers  who  were  guided  by  formal  project  standards. 

7.2.1  Meaningful  Data  Items 


The  "parameters  we  wish  we  had"  is  an  ever  growing  list.  This  is 
especially  true  when  the  object  of  analysis  is  problems  discovered  during 
testing.  They  reflect  all  the  technical,  managerial,  and  political  prob- 
lems that  preceded  the  test  phase.  A short  list  of  data  items  which 
appear  to  be  meaningful  to  the  analysis  of  software  problem  reports  is 
given  in  Table  7-2. 


Table  7-L.  Information  Useful  on  Analyzing  Software  Problem  Reports 


ij  I 

fj  I 


Parameter 

Requirements  and  Design 
Problems 


Problem  Criticality 


Test  Stress 


CPU  Time 


Difficulty  of  a Problem 
Closure 


Problem  Independency 
Factor 


Coding  Load  Factor 

Before  and  After 
Difficulty  Ratings 


Manhour  Availabili  ty 


Description 

Descriptive  information  concerning  the 
type  and  frequency  of  problems  encoun- 
tered during  the  requirements  and 
design  periods. 

An  assessment  of  the  importance  of  the 
problem  to  the  success  of  software 
operation  or  "mission"  completion. 

Quantitative  assessment  of  the  amount 
of  code,  nuitfcer  of  segments,  variation 
of  data  conditions,  etc.  treated  by  each 
test  case  in  a test  program. 

CPU  time  tied  accurately  to  specific 
development  and  test  jobs. 

Relative  difficulty  encountered  in 
closing  a problem  accompanied  by  a 
reason. 

Identification  of  a problem  that  is 
introduced  as  a result  of  a fix  to 
another  problem. 

Accurate  count  of  nunfcer  of  lines  of 
code  during  a given  period  of  time. 

Subjective  assessment  of  difficulty  to 
design,  code,  test,  document,  and  imple- 
ment a routine  marie  fit  first  definition 
of  the  routine  and,  later,  after  the 
routine  reaches  operational  status. 

Accurate  account  lof  number  of  developer, 
test,  customer  personnel  manhours. 


Finally,  information  that  was  found  very  helpful  in  analyzing 
Project  3 data  was  the  textual  "related  experience"  information  which  was 
collected  to  explain  analyses  of  outliers  in  the  metric  studies  of  Sec- 
tion 4.0.  Developers  were  able  to  provide  much  useful  information  when 
asked  to  verbalize  what  made  their  software  difficult  or  easy.  Even  though 
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this  information  was  subjective,  there  was  surprising  commonality  In 
terminology  between  developers.* 

7.2.2  The  Oata  Set  Concept 


Data  that  can  be  collected  during  the  development  process  -vjy  hs 
categorized  generically  into  data  types,  e.g.,  design  data,  product  Iflty 
data,  test  data,  etc.  Within  each  of  these  generic  categork,  <ov  detailed 
parameters  which  support  various  paths  of  analysis,  allow  decisions  w be 
made  concerning  project  status,  indicate  software  quality,  etc.  However, 
the  ability  to  provide  data  from  each  category  will  vary  from  project  to 
project. 

A particularly  attractive  approach  to  data  collection  was  .r.ade  ob.'ious 
by  attempts  to  compare  projects  in  work  done  on  the  SRS.  This  woutd  require, 
first,  that  qu*nt  fi able  descriptions  of  generic  analyses  be  defined  arid, 
second,  that  the  minimal  number  of  parameters  sufficient  to  support  thcro 
ooscriptions  be  identified.  The  resulting  minimal  data  sets  would  be 
applicable  to  varying  degrees  in  virtually  any  project;  large  projects 
could  supply  data  to  many  data  sets,  *hile  smaller  projects  might  be 
1 tinted  to  one  or  two  data  sets.  Parameters  would  be  consistent  from 
project  r.c  project,  however. 

7 * 3 Benefits  of  Data  Cali ect ion  and  Analysis 

Assuming  that  information  is  collected  and  available,  what  can  we 
expect  to  gain  from  its  analysis?  The  long  range  benefits  are  the  most 
obvious.  We  will  learn,  for  example,  how  large  a software  module  can  be 
without  affecting  understandability  and  how  small  it  can  be  before  parti- 
tioning problems  are  encountered,*  Specific  tools  and  techniques  to  improve 
the  development  process  will  alco  be  identified.  These  are  all  things  that 
can  be  applied  in  the  future,  to  the  next  project.  But,  what  about  the 
near  term  payoff  for  the  project  supplying  the  data?  Project  managers, 
when  asked  to  contribute  to  the  data  collection  and  analysis  activities, 
will  invariably  ask  two  questions:  (1)  What  is  the  impact  of  such  a study 
on  ti\y  project?  and  (2)  How  will  the  study  benefit  n\y  project? 

fh*  answer  to  the  first  question  determines  largely  what  the  answer 
to  the  second  question  will  be.  Done  properly,  manpower  and  schedule 
relief  would  be  provided  for  collection  of  data  throughout  the  life  of  the 
project;  much  of  the  useful  data  is  perishable  and  must  be  collected  as 
it  is  created.  Even  if  this  is  not  done  project  involvement  is  required 
ir,  the  analysis  because  individual  performers  alone  are  able  to  provide 
some  of  the  data  with  sufficient  accuracy  (e.g.,  causative  error  data). 

Access  to  project  experience  is  essential  to  accurate  interpretation, 
even  if  the  analysis  is  done  independently.  The  more  project  commitment 
there  is  to  such  a study,  the  greater  the  impact  on  project  resources  and 
the  greater  the  yield  of  usable  information.  The  extent  of  this  corrmit- 
ment  must  be  determined  for  each  individual  project. 


i *Por  chat  one  project. 

I 7-10 


i 

l 


I 


The  benefit  to  the  on-going  project  comes  from  increased  awareness  of 
problems  and  better  control  over  the  development  process.  This  is  parti- 
cularly true  if  the  project  is  of  sufficient  duration  to  allow  findings  to 
be  folded  back  into  the  project  in  the  form  of  improvements,  For  example, 
results  of  analysis  of  Project  5 integration  test  data  were  used  In  a brief- 
ing to  project  performers  on  improvement  of  test  techniques.  In  this  trief- 
ing  it  was  recommended  that  unit  level  test  case  preparation  include  special 
emphasis  on  "what  if"  thinking  in  selecting  data  to  exercise  routine  logic. 
Put  another  way,  this  calls  for  the  test  designer  to  ask  what  will  happen 
*f  various  input  data,  including  extremes,  singularities,  and  combinations 
of  parawter  settings*  are  presented  to  the  software.  Hill  the  test  cases 
demonstrate  the  software's  ability  to  handle  these  data  scenarios?  This 
recommendation  was  in  response  to  the  most  common  logical  error,  missing 
logic  or  condition  tests*,  Incidentally,  this  feedback  of  results  to  pro- 
ject performers  also  provides  an  excellent  opportunity  to  point  out  posi- 
tive findings  and  specific  successes,  as  well  as  areas  needing  improvement. 

Another  benefit  of  data  collection  and  analysis  may  be  in  the  way 
project  performers  view  their  errors.  Several  Project  5 programmers 
admitted  that,  although  they  didn't  particularly  enjoy  categorizing  their 
own  errors,  this  task  helped  register  the  error  in  their  memories,  and 
they  tended  not  to  repeat  errors  of  the  same  type. 

In  summary,  data  collection  and  analysis  on  software  development  pro- 
jects has  been  beneficial  in  the  following  ways. 

• It  has  helped  us  to  better  understand  the  various  factors  and 
difficulties  characterizing  the  software  and  the  development 
process. 

t It  has  given  us  a benchmark  for  estimating  certain  parameters, 
such  as  size  and  cost,  during  the  proposal  stage  of  new  work.** 

• It  has  sho”  down  some  and  confirmed  others  of  our  intuitive  notions 
about  software  and  its  development.  It  has  also  given  us  some 
surprises. 

• It  has  given  us  a healthy  respect  for  the  difficulties  of  collect- 
ing and  interpreting  data. 

« Probably  most  important,  it  has  demonstrated  a potential  for  bene- 
fit to  the  project;  supplying  the  data. 


*0ne  development  group  admitted  use  of  this  sort  of  common  sense  approach 
not  only  in  creating  ter.c  rases  but  also  in  design.  They  then  challenged 
the  authors  to  find  a missing  louic  error  in  the  integration  test  data. 
None  were  found. 

**Where  this  new  work  is  similar  to  the  projects  studied. 
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Data  collection  is  truely  on  its  infancy.  We've  identified  some 
significant  problems  and  have  some  project-specific  solutions  for  these 
problems.  Overall  the  experience  has  been  a positive  one  and  the  benefits 
are  real. 


0.0  CONCLUSIONS  AND  RECOWENDATIONS 


Our  first  finding  on  the  Software  Reliability  Study  was  that  software 
projects  have  a potential  for  creating  tremendous  amounts  of  useful  data. 
And,  in  a disciplined  software  development  or  maintenance  environment, 
data  can  be  a by-product  of  all  project  phases  and  activities.  To  guaran- 
tee data  quality,  however,  the  collection  process  should  be  event  driven, 
i.e.,  information  should  be  collected  as  events  create  it,  not  after  the 
fact.  This  means  that  the  collection  mechanism  must  be  planned  and  ready 
in  advance,  and  that  data  items  must  be  well  defined  prior  to  their 
collection. 

Analysis  of  error  data  showed  that  most  of  our  errors  were  design  and 
requirements  errors,  as  opposed  to  coding  errors  and  errors  made  during 
the  correction  of  other  errors.  Although  software  development  projects 
typically  expend  much  effort  in  requirements  and  design  reviews,  these 
sources  of  error  were  shown  to  represent  major  portions  of  the  total  errors 
detected  during  formal  testing.  For  one  project  it  was  also  shown  that 
the  ratio  of  requirements  and  design  errors  to  coding  errors  may  well 
decrease  with  project  maturity  in  the  top-down,  multiple  increment  develop- 
ment process.  These  results  suggest  that  further  detailoi!  analyses  of 
requirements  and  design  errors  is  necessary  and  that  there  a need  for 
aids,  either  tools  or  techniques,  to  assist  the  developer  in  creating 
both  the  requirements  and  the  software,  design. 

A detailed  look  at  error  types  showed  that  logic  errors  were,  percent- 
agewise, the  most  frequent  errors  in  each  of  the  projects  examined.  Of 
these  logic  errors,  the  most  frequent  error  was  the  one  where  logic  to 
handle  a specific  data  condition  was  Missing.  Again,  this  supports  the 
need  for  requirements  and  design  aids  to  describe  the  need  and  the  design 
solution,  respectively.  Next,  in  order  of  occurrence,  were  the  data 
handling  or  computational  errors,  depending  on  the  type  of  software  being 
analyzed.  Data  base  errors  and  changes  due  to  tuning  also  represented  a 
significant  percentage  of  the  total  error  history. 

One  very  positive  outcome  of  this  study  was  the  similarity  in  frequency 
of  occurrence  of  various  error  categories  for  nonsimilar  projects. 

Although  comparison  of  projects  was  discouraged  in  this  study,  the  near 
equivalent  (i.e.,  standardized)  error  category  lists  used  in  the  analysis 
made  these  similarities  visible.  Standardized  data  description  and 
analysis  techniques  should  make  future  project  comparisons  possible  in 
a number  of  areas.  Since  many  of  the  investigations  pursued  in  this  stu,dy 
were  improved  by  grouping  the  software  by  function  (e.g.,  data  base 
management)  or  by  purpose  (e.g.,  computational)  it  is  recommended  that 
future  work  be  aimed  at  standardizing  these  categories  for  project  compari- 
son purposes. 
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Error  categorization  way  be  symptomatic  or  causative.  To  perform 
cuegorization,  three  pieces  of  information  are  needed:  The  symptoms  of 
the  problem,  a description  of  the  cause,  and  a description  of  the  fix. 

Two  pieces  of  additional  information  are  also  needed:  error  criticality, 
because  all  errors  are  not  eoual  in  a reliability  sense,  and  the  effort 
required  to  correct  each  error.  It  is  recommended  that  criticality  be 
keyed  to  “mission''  success,  where  the  following  criteria  dictate  the 
error's  criticality. 


Critical 

- user's  mission  cannot  be  completed 

High 

- user's  mission  can  be  continued  but  performance 
is  degraded 

Medium 

- workaround  is  available,  performance  not  degraded 

Low 

- performance,  not  affected 

Noncri tical 

- product  improvement 

The  amount  of  effort  required  to  fix  an  error  should  take  into  account 
1)  time  to  find  the  error,  2)  time  to  create  r fix,  and  3)  Time  to  test 
the  fix.  It  should  be  noted  that  collection  of  this  information  is  not 
trivial  and  implies  availability  of  individuals  qualified  to  provide 
the  information*. 

Although  investigations  of  the  effects  of  size  on  reliability  showed 
a good  linear  fit  for  preoperational  software,  indicating  no  preferable 
size  to  minimize  errors,  operational  data  show  a undency  for  large 
routines,  say  >1000  total  source  statements,  to  be  more  error-prone  in 
the  operational  environment.  That  is,  it  is  possible  to  thoroughly  test 
and  remove  all  errors  from  the  smaller  routines. 

Standard  measures  of  test  thoroughness  are  also  needed.  Work  in 
connection  with  the  Mathematical  Theory  of  Software  Reliability  are 
particularly  attractive  since  this  model  treats  not  only  a measure  of 
the  amount  of  code  exercised  in  a test  but  also  variation  of  the  input 
data  space  and  functional  capabilities  of  the  software. 

Two  ir.^ortant  modifications  to  the  standard  regression  analysis 
were  used  in  the  course  of  the  study  of  software  attributes  and  metrics, 
The  first  was  the  application  of  the  constraint  of  non-negativity  to  al'i 
influence  coefficients  of  parameters  assumed  to  exert  a positive  influence 


Criticality  and  effort  to  fix  were  sufficiently  difficult  to 
collect  accurately  that  they  were  not  available  for  this  study. 
They  were,  however,  Identified  repeatedly  as  information  that 
is  needed, 
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on  numbers  of  problems.  This  was  used  as  a screening  technique  only 
since  standard  statistical  analyses  cannot  be  applied  with  these  con- 
straints on  the  estimation  process.  The  second  modification  was  to 
make  another  structural  constraint  on  the  linear  regression  model  to 
exclude  undefined  parameters;  in  other  words  to  eliminate  ’.he  free 
cons  tent  usually  assumed  in  linear  regression  analyses.  Another  way  o* 
stating  the  same  thing  is  that  the  regression  plane  was  forced  through 
the  origin.  This  assumption  appeared  to  be  substantially  borne  oo%,  in 
that  plots  of  number  of  problems  versus  each  parameter  tend  to  form  a 
"ray"  from  the  origin.  The  Impost  ti'"'  of  this  assumption  on  the  regression 
anal>sis  estimation  procedure  usually  results  in  a higher  multiple 
correlation  coefficient  than  if  the  constant  term  is  left  in. 

Use  of  test  error  data  in  suggesting  improvements  ir.  development 
and  test  techniques  was  shows  to  be  possible.  Similar  studies  using 
data  from  other  software  development  activities,  such  as  requirements  and 
design  reviews,  should  also  be  possible, 

finally,  when  work  was  star.ed  on  the  Software  Reliability  study, 
opinions  from  investigators  (and  from  performers  on  projects  supplying 
the  data)  were  mixed  concerning  whetner  the  data  would  show  anything  at 
all.  The  only  precedent  at  the  time  was  a study  done  for  the  CC"P-85 
Study  Group  [5]  which,  for  all  of  its  data,  was  not  very  productive  in 
terms  of  final  results.  The  skeptics  pointed  to  this  initial  attempt 
and  predicted  similar  inconclusive  results  for  this  study.  Yet,  for 
those  S.R.S.  investigators  who  worked  on  *be  projects  supplying  data  to 
both  study  projects,  it  seemed  that  the  data  had  to  be  indicative  pf 
what  happened  during  the  software  development  process  and  of  the 
reliability  of  the  software  itself. 

One  of  the  first  things  we  leaned  in  our  study  was  that  t!;e  earlier 
investigators  had  developed  useful  techniques,  laying  the  groundwork  for 
subsequent  investigations.  These  techniques  were  their  results.  In 
the  S.R.S.  we  altered  definitions,  collected  needed  data  not  previously 
available,  and  refined  the  techniques,  but  the  basic  approach  to  investi- 
gation of  empirical  data  was  the  one  suggested  by  Reference  [5]  . It  was 
a valuable  first  step,  and  we  didn't  have  to  start  from  scratch.  To 
this  we  added  techniques  suggested  by  ocher  investigators,  again  altering 
and  modifying  where  improvements  were  necessary.  We  conclude,  then, 
that  the  analysis  of  empirical  data  is  in  its  infancy,  and  no  single 
"best"  approach  has  yet  been  identified.  We  recommend  that  future 
investigators  not  discard  techniques  that,  for  others,  have  proven 
Inccnciusfve.  Rather,  techniques  should  be  considered  as  modifiable 
candidates  ft-r  future  investigations.  Those  presented  in  this  study 
should  be  no  exception. 
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APPENDIX  A 
GLOSSARY  OF  TERMS 


C0MP00L 


Delivery 


Development  Cycle 


Direct  Interface 


Formal  Testing 


Function 


Critical  Design  Review  (as  described  in  MIL-STD-1521) , 
Configuration  Management 

An  operating  system  capability  which  provides  single 
source  da -a  base  definition  and  structure  for 
global  variables. 

The  point  at  which  the  software  package  is  turned 
aver  t.o  a customer  for  use  in  the  operational 
enviroiwtnt. 

A series  of  tasks  performed  in  serial  order  (for  the 
projects  in  this  study)  to  create  the  deliverable 
software  product  and  its  documentation.  These  tasks 
consist  of  requirements  analysis,  preliminary  and 
detailed  design,  coding  and  checkout,  and  formal 
testing.  It  is  appropriately  called  a cycle  because 
it  may  be  applied  iteratively,  as  in  the  Project  5 
top-down  development  approach.  (See  Section  7.0) 

An  interface  immediately  between  two  software 
elements 


Routine  1 


Routine  2 


Routine  3 


In  the  example  here  Routine  1 interfaces  directly 
with  Routine  2 and  indirectly  with  Routine  3. 

Design  Problem  Report  - used  to  document  problems 
in  the  design  for  preliminary  and  critical  design 
reviews. 

Discrepancy  Report  - the  Project  5 software  problem 
report 

Document  Update  Transmittal 

Final  Operational  Configuration 

Testing  conducted  according  to  test  procedures 
which  are  documented  and  approved  by  contractor 
and  customer. 

A grouping  of  routines  which  performs  a prescribed 
function  (structural  elements  of  Projects  2 and  3). 
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GLOSSARY  CF  TERMS  (Continued) 


Instructions 


Internal  Delivery 


IOC 

Metric 


MTM 

MTSR 

Operational 


OS 

PA 


PDR 


Routine 

SMR 

SPR 

S.R.S. 


S/S 

Statements 

S/W 

User 


Machine  instruction*  (machine  dependent 
parameter) . 

The  point  at  which  the  software  as  an  entire 
package  is  given  to  the  independent  test 
group. 

Initial  Operational  Configuration. 

A measure  of  the  extent  or  degree  to  which  the 
sc-ftware  possesses  and  exhibits  a certain 
characteristic,  quality,  property,  or  attribute. 

Modification  Transmittal  Memorandum  (Projects  2 
and  3). 

Mathematical  Theory  of  Software  Reliability 

The  status  given  a software  package  once  it  has 
completed  contractor  testing  and  is  turned  over 
to  the  eventual  :*ser  for  use  in  the  applications 
environment. 

Operating  System. 

Product  Assurance.  An  organizationally  independent 
group  charged  with  specifying  and  enforcing  adher- 
ence to  software  quality  standards  in  the  areas  of 
requirements  design,  coding,  all  phases  of  testing, 
and  management. 

Preliminary  Design  Review  (as  described  in 
MIL -STD-1 521). 


Smallest  group  of  compilable  code 
Software  Modification  Record 
Software  Problem  Report 
Software  Reliability  Study 
Subsystem 

Programming  Language  at  the  Source  Code  Level 
Software 


The  individual  at  the  man/machine  interface  who 
is  applying  the  software  to  the  solution  of  a 
problem,  e.g.  test  or  operations. 
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This  appendix  presents  blank  examples  of  data  collection  forms  actually 
used  and  forms  similar  to  those  used  to  collect  error  data  for  the  Software 
Reliability  Study. 

Three  software  problem  reports  am  presented  first.  These  are  the 
forms  used  to  document  software  errors  during  testing.  Both  the  one  and 
two  piece  of  paper  approaches,  discussed  in  Section  7.0,  are  represented  by 
the  SPR  and  DR,  respectively*. 

Used  on  Projects  2 and  3,  the  SPR  (page  B-2)  opens  the  problem  by 
identifying  the  symptoms  of  the  srror,  the  test  configuration,  and  if  appro- 
oriate,  a proposed  solution.  The  problem  remains  open  until  it  is  closed  by 
a modification  transmittal  memorandum  or  HTM  (page  B-3)  which  explains  and 
delivers  the  fix.  Hence  the  reference  to  a two  piece  of  paper  approach. 

The  discrepancy  report,  or  DR  (page  B-4),  is  used  in  an  approach 
where  one  form  documents  the  error  and  later  the  fix.  This  form  was  used 
on  Project  5. 


A third  example  of  a software  problem  report  is  also  given.  This  form 
and  its  companion  closure  form,  the  Software  Modification  Record  (SMR) , are 
presented  on  pages  B-5  through  B-12.  Both  were  designed  during  the  early 
part  cf  the  Software  Reliability  Study  for  application  on  Project  5.  The 
project  eventually  selected  as  Project  5 was  not  the  c?>e  for  which  the  SPR/ 
SMR  pair  was  developed,  and  since  it  already  utilized  the  DR  as  a successful 
problem  reporting  form,  no  attempt  was  made  to  impose  the  SPR/SMR  pair. 
Incidentally,  the  SPR/SMR  pair  was  used  successfully  on  the  project  for 
which  it  was  designed. 


A sample 
page  B-13. 


of  the  Design  Problem  Report  (DPR)  is  presented  on 


*Note:  SPR  - Software  Problem  Report 
DR  - Discrepancy  Report 
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DATE 
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PROBLEM  DESCRIPTION 
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Software  Problem  Report  (Continued) 


PARAMETER 
LOG  HO. 


LOG  DATE 
TIME 

TEST  PHASE 


STATUS 


KEY  PARAMETER  DESCRIPTION 

0 A unique  SPR  number  assigned  by 

Configuration  Management  (CM).  This 
number  may  have  an  alpha  prefix  to 
denote  additional  information,  (e.g., 
a development  prefix  to  distinguish 
these  SPRs  from  problems  documented 
during  a parallel  maintenance  phase 
of  an  earlier  version  of  the  same 
software.)  Numbers  are  sequential. 

0 The  date  the  problem  is  logged  by  CM. 

0 This  is  the  date  and  time  of  day  the 

problem  was  discovered* 

0 Test  phase  during  which  the  problem 

was  discovered. 

DEV  - development  test 
INTE6  - integration  test 
VAL  - validation  test 
ACC  - acceptance  test 
SITE  >•  operational  problem 

0 Status  is  a dynamic  indicator  maintained 

in  the  CM  records.  It  is  on  the  SPR 
for  use  by  developers  and  testers  in 
tracking  and  reporting  latest  status 
to  CM.  Codes  indicate  the  following: 

1 - The  Test  Working  Group  is 

reviewing  SPP.  to  determine 
appropriate  action. 

2 - SPR  has  been  assigned  to  a 

developer  for  correction. 

3 - Fix  available,  has  been  tested, 
and  is  ready  for  delivery  to 
master  program  library. 

Master  program  library  has  been 
updated;  retest  of  fix  on  master 
program  library  not  complete. 
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Software 

Problem 

Report  (Continued) 

wmmR 

KEY 

PARAMETER  DESCRIPTION 

STATUS  (Continued) 

5 - Test  rerun,  problem  still  exists, 
SMR  rejected. 

6 - Test  rerun,  fix  works,  SPR 
closed. 

7 - Hold  for  future  closure 

Problem  is  not  reproducible,  is 
product  improvement,  very  low 
priori ty,.  etc. 

ORIGINATOR/EXTENSION 

© 

Name  and  telephone  extension  of  author 
of  SPR. 

PROBLEM  WITH 

© 

Identification  of  whether  the  problem 
is  in  a routine,  the  date  base,  a 
document,  or  snap  combination  of  th;*se. 

ROUTINE/ELEMENT/SS 

Name  of  the  routine  exhibiting  the 
problem.  I*  the  routine  is  not  known, 
identify  the  element  or  subsystem 
(i.e.,  provide  the  maximum  amount  of 
detail  possible) 

MOD 

Modification  of  routine  exhibiting 
the  problem,  if  known. 

TAPE 

© 

Master  Program  Library  tape  l.D.  with 
the  offending  routine. 

DATA  3ASE 

© 

ID  of  data  base  used  when  the  oroblem 
was  discovered. 

DOCUMENTS 

© 

CDRL  number(s)  of  document(s)  exhibiting 
errors. 

TEST  CASE 

© 

ID  of  the  principal  test  case  which 
demonstrated  the  error. 

HARDWARE  UNIT 

ID  of  the  computer  system  being  operated 
when  the  problem  was  discovered. 
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Siftwi re  Problem  Report  (Continued) 

PARAMETER  <EY  PARAHtYER  DESCRIPTION 

PkOBUtf  OESCRIPT20N/ IMPACT  © Detailed  description  of  the  symptoms 

of  the  problem  and,  if  possible,  a 
description  of  the  actual  problem. 
Impact  of  the  problem  on  future 
testing,,  on  interfacing  software, 
documentation,  etc.  should  also  be 
provider, 

NOTES  ® Working  area  for  status  keeping, 

additional  information,  etc. 
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Software  Modification  Recc'd  (Continued) 


PARAMETER 

KEY. 

PARA  METER  DESCRIPTION 

LOG  NO 

@ 

A unique  SMR  number  assigned  by  CM 
upon  receipt  of  the  modification. 

LOG  DATE 

© 

Date  the  SMR  is  logged  by  CM. 

TIME 

© 

Tir  i the  SMR  and  tne  delivered 
modification  are  available  to  CM 
and  logged. 

ORIGINATOR 

© 

Author  of  the  modification  (generally 
the  closer  of  the  SPR(s)) 

SS 

© 

Subsystem  affected  by  the  modification 
being  delivered. 

ROUTINE 

© 

Na-re  of  the  routine  for  which  the 
mo<Hf (catirr  «s  being  delivered 
(This  field  is  a remat  of  field  (p) 

RESPONSE  TO  sprs 

© 

SPR  numba'-(s)  being  totally  or  partially 
cl  os*4!  by  ths  delivery.  If  an  SMR 
partially  closes  * problem,  a P is 
apnr.ded  to  •'jhe  JPR  number  (e.g.,  1L'3A(P)) 

RESPONSE  INCLUDES 

© 

Identification  of  elements  ir  the  delivery; 
e.g.,  a routine  modification,  document 
updatj  or  data  base  change  or  a 
combination  of  these.  IV  the  SMR 
effects  cloture  of  a problem  with  an 
explanation  this  is  indicated. 

RESPONSE 

© 

Detailed  description  of  the  correction 
being  made  to  the  software.  In  ‘‘he 

APPROVAL 


Q) 


case  of  a document  update  nr  da*-a 
base  change,  the  document  update 
transmittal  (OUT)  and  data  base  change 
request  (DBCk)  numbers,  respectively, 
are  referenced  with  the  description 
ef  the  ppcessary  change  in  (?)  and  (S), 

Signature  of  appropriate  manager. 


Software  Hodificatlort  Record  (Continued) 


PARAMETER 
CODE  TYPE: 


, i 


I 1 


KEY  PARAMETER  DESCRIPTION 

® Tvpe(e)  of  source  code 

invclved  in  the  routine 
modi fi cot  Ion. 

I/O  - input,  output,  or  formatting 
statements,  etc. 

COMP  - computational  code 


>! 1 


ROUTINE 


LOGICAL  - code  that  establishes 
brandies  In  the  program. 

DATA  HAND  - code  that  moves  data 
from  place  to  place,  stores 
data,  etc.  That  is,  data 
handling  code. 

(L)  Name  of  routine  being  modified, 

document  being  changed,  or  data 
base  being  altered.  In  the  case 
of  an  explanatory  SMR  the  name  given 
on  the  SPR  is  ••epeated. 


OLD  MOD 
NEW  MOD 


S|  Identification  of  old  routine 

| modification  to  be  altered  to  produce 

the  new  routine  modification. 


u 


fa 


n 

fl 


li 


REF  DATA  BASE 
DBCR 


REF  DOCUMENT 
DUT 


HAS  THE  FIX  BEEN  TESTED?  (!) 


If  a data  base  change  is  in  order, 
this  supplies  the  data  base  identifier 
to  which  the  changes  delivered  by 
the  data  base  change  request  are  to 
be  applied. 

If  corrective  action  also  requires 
a change  to  a document,  the  title 
of  the  document  to  be  changed  is 
given  and  the  document  update 
transmittal  (DUT)  delivering  the 
change  is  referenced  by  number. 

Testing  of  a modification  made  to  a 
routine  must  be  completed  at  appropriate 
predetermined  levels  prior  to  delivery 
of  a modification.  Element,  subsystem, 
integration,  validation,  and 
operational  testing,  if  applicable, 
will  be  indicated.  Remarks  indicating 
test  success  are  invited. 


B-ll 


Software  Modification  Record  (Continued) 


PARAMETER 

WAS  PROBLEM  CORRECTLY 
STATED  ON  rPR? 


PROBLEM  SOURCE 


ESTIMATED  RESOURCES 


PARAMETER  DESCRIPTION 

Indication  of  accuracy  of  the 
problem  statement  is  to  be  given. 
t\t\  accurate  restatement,  of  the 
problem  is  to  be  given  in  the 
remarks  section. 

Identification  of  the  source  of  the 
Problem. 

System  Specification  - requirements 
document 

Development  Specification  - 
preliminary  design  document 

Product  Specification  - detailed 
design  document 

DB  - data  base 

CODE  - source  code  only 

Resources  required  to  close  the 
problem  in  manhours  of  work  and 
minutes  of  computer  time. 


ACTION/CATEGORY: 

ID 

2D 

3D 

ACTION  ITEM  NO.  *D 

DISPOSITION:  ~~~  — , — — ~ ~ 


CLOSED Q 
>-17-71 


B-13 


y 


~+r- 


'lal-ri 


^.y  -■  -**;»*<-«■ 


£6Qt#Zi0?t)SfX^j^0~S^Oi^^ 
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Rome  Air  Development  Center 


RACC  planj  and  conduct * research,  exploratory  And  Advanced 
development  program  in  command  , control , end  cotmunications 
(C3)  Activities,  and  in  the  C3  areas  of  information  science* 
end  intelligence.  The  principal  technic*!  mission  areas 
are  comunications , electromagnetic  guiCance  and  control , 
surveillance  of  ground  a:j d aerospace  objects,  intelligence 
data  collection  and  bandlirj,  information  jysten  technology, 
ionospheric  propagation,  solid  state  sciences,  microttare 
physics  and  electronic  reliability,  maintainability  and 
compatibility. 
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